【Translation】What have we wrought?

What have we wrought?

我们都做了什么?

I’ve painted a grim picture. But anyone can pick out small details in published studies and produce a tremendous list of errors. Do these problems matter?

我虽然指出了严峻的现实。但是任何人的都可以从已出版的研究中挑出一系列的错误。这些真的是有意义的么?

Well, yes. I wouldn’t have written this otherwise.

当然。否则我也不会写出这本书。

John Ioannidis’s famous article “Why Most Published Research Findings are False”31 was grounded in mathematical concerns rather than an empirical test of research results. If most research articles have poor statistical power – and they do – while researchers have the freedom to choose among multitudes of analyses methods to get favorable results – and they do – when most tested hypotheses are false and most true hypotheses correspond to very small effects, we are mathematically determined to get a multitude of false positives.

John Ioannidis那篇有影响的文章“为什么大多数发表的研究发现是错误的”是基于数学方面的考虑,而非对研究结果的经验测试。如果大多数论文都不具备统计功效——事实也确实是这样——因为研究者有选择有利于得到期望结果的分析方法的自由——事实上他们确实是这样做的——当大多数假设检验其实是错误的,大多数真的假设对应的效应规模又太小,我们很容易从数学上得出大量的假阳性结果。

But if you want empiricism, you can have it, courtesy of John Ioannidis and Jonathan Schoenfeld. They studied the question “Is everything we eat associated with cancer?”51[1] After choosing fifty common ingredients out of a cookbook, they set out to find studies linking them to cancer rates – and found 216 studies on forty different ingredients. Of course, most of the studies disagreed with each other. Most ingredients had multiple studies claiming they increased and decreased the risk of getting cancer. Most of the statistical evidence was weak, and meta-analyses usually showed much smaller effects on cancer rates than the original studies.

但如果你想遵循经验主义,你也可以这样做,但是请尊重John Ioannidis和Jonathan Schoenfeld的工作。他们研究“是否所有我们摄入的东西都会导致癌症?”这个问题。他们从一本食谱中选出50种常见的配料,然后寻找与之相关的致癌率的研究——结果发现关于40种配料有216个研究。当然,大多数研究结果是相互矛盾的。针对多数配料都得出过促进或者抑制患癌症的风险。大多数研究的统计证据都很弱,而元分析得出的在致癌率的效应规模比原始研究要小更多。

Of course, being contradicted by follow-up studies and meta-analyses doesn’t prevent a paper from being cited as though it were true. Even effects which have been contradicted by massive follow-up trials with unequivocal results are frequently cited five or ten years later, with scientists apparently not noticing that the results are false.55 Of course, new findings get widely publicized in the press, while contradictions and corrections are hardly ever mentioned.23 You can hardly blame the scientists for not keeping up.

当然,与后续研究或者元分析矛盾并不能阻止一篇论文被当做正确的文献引用。即使有大量的后续试验证实了其效应结果是错误的,在五年或者十年后还是经常会被引用,研究人员显然没有发现其结果是错误的。当然,新的发现会更多地被发表,但是人们并没有提及与之前研究矛盾与更正的部分。所以我们也不能全部怪罪科学家们没有与时俱进。

Let’s not forget the merely biased results. Poor reporting standards in medical journals mean studies testing new treatments for schizophrenia can neglect to include the scale they used to evaluate symptoms – a handy source of bias, as trials using unpublished scales tend to produce better results than those using previously validated tests.40 Other medical studies simply omit particular results if they’re not favorable or interesting, biasing subsequent meta-analyses to only include positive results. A third of meta-analyses are estimated to suffer from this problem.34

我们也不能忘记结果的偏见。在医药期刊上不严谨的报告标准可能意味着精神分裂症的新治疗方案检验时忽略评估症状的标准——一个常见的偏见是,在试验中趋于使用未发表的标准而非已经发表过验证的试验测试使用的标准,这样更容易产生期望结果。其他医学研究是指简单地删除部分对自己研究不利或者不感兴趣的结果,偏见导致后续的元分析值包含了阳性结果。有三分之一的元分析都存在这样的问题。

Another review compared meta-analyses to subsequent large randomized controlled trials, considered the gold standard in medicine. In over a third of cases, the randomized trial’s outcome did not correspond well to the meta-analysis.39 Other comparisons of meta-analyses to subsequent research found that most results were inflated, with perhaps a fifth representing false positives.45

另外一份综述对比了针对大量随机试验的后续元分析,使用医学黄金标准。超过三分之一的结果是,随机试验的记过并不能与元分析的结相一致。其他关于后续研究的元分析发现大多数结果都是膨胀过的,也许有五分之一的结果其实是假阳性。

Let’s not forget the multitude of physical science papers which misuse confidence intervals.37 Or the peer-reviewed psychology paper allegedly providing evidence for psychic powers, on the basis of uncontrolled multiple comparisons in exploratory studies.58 Unsurprisingly, results failed to be replicated – by scientists who appear not to have calculated the statistical power of their tests.20

我们不要忘了那些误用置信区间的大量物理科学论文。或者同行评审的心理学论文据称是提供了精神力量的证据,其实只是从诸多未受控的对比探索性研究中得出的偏见。不意外的是,这些结果都不能被复制——尤其是那些没有计算他们研究统计功效的科学家们。

We have a problem. Let’s work on fixing it.

这的确是个问题,我们要想办法解决它。

 

Source:https://www.statisticsdonewrong.com/results.html