【Translation】Everybody makes mistakes

Everybody makes mistakes

每个人都会犯错

Until now, I have presumed that scientists are capable of making statistical computations with perfect accuracy, and only err in their choice of appropriate numbers to compute. Scientists may misuse the results of statistical tests or fail to make relevant computations, but they can at least calculate a p value, right?

至此,我一直假设科学家有能力完成准确的统计运算,只在选择数字的时候会犯错。科学家可能会误用统计检验的结果或者在相关计算上犯错,但是他们至少能把p值给算出来,对吧?

Perhaps not.

也许并不。

Surveys of statistically significant results reported in medical and psychological trials suggest that many p values are wrong, and some statistically insignificant results are actually significant when computed correctly.252 Other reviews find examples of misclassified data, erroneous duplication of data, inclusion of the wrong dataset entirely, and other mixups, all concealed by papers which did not describe their analysis in enough detail for the errors to be easily noticed.126

一项关于医疗和心理试验报告里统计显著差异结果的调查显示,很多p值的计算是错误的,而有些关于统计无显著差异的结论如果计算正确的话结果是满足统计差异性的。其他综述发现错误分类的数据,错误复制的数据,甚至是完全错误的数据样本,和大量杂糅在一起的数据,都被论文忽视了,这些论文对那些显而易见的问题并没有给出足够详细的解释和分析。

Sunshine is the best disinfectant, and many scientists have called for experimental data to be made available through the Internet. In some fields, this is now commonplace: there exist gene sequencing databases, protein structure databanks, astronomical observation databases, and earth observation collections containing the contributions of thousands of scientists. Many other fields, however, can’t share their data due to impracticality (particle physics data can include many terabytes of information), privacy issues (in medical trials), a lack of funding or technological support, or just a desire to keep proprietary control of the data and all the discoveries which result from it. And even if the data were all available, would anyone analyze it all to spot errors?

阳光是最好的消毒剂,很多科学家呼吁将实验数据在网上公开。这在一些领域很常见:现存的基因序列数据库、蛋白结构数据银行、天文观测数据库、以及由数千科学家贡献的地球观测数据集合。然而,很多其他领域因为不实际的(应用物理数据)和隐私原因(医疗试验),或者因为没有资金或者技术的支持,或者只是为了保持对研究结果所有权的控制而不公开数据。即使数据是公开的,又有谁只是为了纠错去分析它们呢?

Similarly, scientists in some fields have pushed towards making their statistical analyses available through clever technological tools. A tool called Sweave, for instance, makes it easy to embed statistical analyses performed using the popular R programming language inside papers written in LaTeX, the standard for scientific and mathematical publications. The result looks just like any scientific paper, but another scientist reading the paper and curious about its methods can download the source code, which shows exactly how all the numbers were calculated. But would scientists avail themselves of the opportunity? Nobody gets scientific glory by checking code for typos.

同样的,一些领域的科学家将脚步移到了使用技术工具来进行统计分析。例如一个叫做Sweave的工具,能够很轻松地在LaTex格式(科学和数学出版标准)的论文中嵌入常用的R语言进行的统计分析。其外观和任何其他科学论文一样,但是其他读论文的科学家对文章感兴趣的可以下载源码,查看计算的细节。但是科学家会选择这种方式吗?毕竟没有哪个人会因为查出代码里错误的拼写而得到科学荣誉。

Another solution might be replication. If scientists carefully recreate the experiments of other scientists and validate their results, it is much easier to rule out the possibility of a typo causing an errant result. Replication also weeds out fluke false positives. Many scientists claim that experimental replication is the heart of science: no new idea is accepted until it has been independently tested and retested around the world and found to hold water.

这么做的另一种结果可能导致重复。如果科学家谨慎地重建其他科学家已经执行的试验,然后验证他们的结果,排除由错误的拼写导致的错误的结果就很简单了。重复试验同时能清除一些假阳性。很多科学家声明重复试验是科学的核心:在经过全世界一轮又一轮的检验,直到发现满足标准之前,任何新结论都无法被接受。

That’s not entirely true; scientists often take previous studies for granted, though occasionally scientists decide to systematically re-test earlier works. One new project, for example, aims to reproduce papers in major psychology journals to determine just how many papers hold up over time – and what attributes of a paper predict how likely it is to stand up to retesting.[1] In another example, cancer researchers at Amgen retested 53 landmark preclinical studies in cancer research. (By “preclinical” I mean the studies did not involve human patients, as they were testing new and unproven ideas.) Despite working in collaboration with the authors of the original papers, the Amgen researchers could only reproduce six of the studies.5 Bayer researchers have reported similar difficulties when testing potential new drugs found in published papers.49

这么说也不是完全正确;科学家们经常随意使用前人的研究成果,偶尔也有人决定系统地重复之前的研究。例如一个新的项目,目的是通过重做主流心理学期刊上的试验来判断有多少文章经得起时间的考验——而论文的贡献又有多少概率经得起重复试验的考验。另一个例子,Amgen癌症研究人员重新测试了53个临床前研究的关键节点。(这里的“临床前”指的是没有涉及人类患者的研究,因为他们测试的是全新未经证实的方案)Amgen并没有和原著作者合作,而只重新进行了研究中的六项。拜耳研究人员在测试新药品时也提出了相似的困难。

This is worrisome. Does the trend hold true for less speculative kinds of medical research? Apparently so: of the top-cited research articles in medicine, a quarter have gone untested after their publication, and a third have been found to be exaggerated or wrong by later research.32That’s not as extreme as the Amgen result, but it makes you wonder what important errors still lurk unnoticed in important research. Replication is not as prevalent as we would like it to be, and the results are not always favorable.

这是令人烦恼的。这个趋势在很少有人质疑的医学研究领域也存在吗?答案是当然:在医学领域引用最多的文章中,有四分之一是未经试验的,三分之一的结论后被证明是夸张或者错误的。但都没有Amgen的结论那么夸张,但这确实引发我们思考在重要的研究中还存在哪些未经发现的错误。重复试验并没有像我们期待的那样流行,因为结果往往不那么令人满意。

Source: https://www.statisticsdonewrong.com/mistakes.html