【Translation】When differences in significance aren’t significant differences

When differences in significance aren’t significant differences

当具有显著性的差异并显著差异

“We compared treatments A and B with a placebo. Treatment A showed a significant benefit over placebo, while treatment B had no statistically significant benefit. Therefore, treatment A is better than treatment B.”

“我们比较A和B两种使用安慰剂的治疗方案,其中方案A给出显著性疗效,而方案B中新药对比安慰剂并没有显示出显著疗效。因此方案A比方案B要更有效。”

We hear this all the time. It’s an easy way of comparing medications, surgical interventions, therapies, and experimental results. It’s straightforward. It seems to make sense.

我们总是能听到这样的判断。比较不同药品、手术介入、治疗方案和实验结果,因为直观,看上去也有道理。

However, a difference in significance does not always make a significant difference.22

然而,那些看上去显著的差异有时候并不算真正意义的显著差异。

One reason is the arbitrary nature of the p<0.05 cutoff. We could get two very similar results, with p=0.04 and p=0.06, and mistakenly say they’re clearly different from each other simply because they fall on opposite sides of the cutoff. The second reason is that p values are not measures of effect size, so similar p values do not always mean similar effects. Two results with identical statistical significance can nonetheless contradict each other.

一个原因是选取p<0.05 做阈值的随机特性。我们是可以达到非常相近的结果,比如p=0.04和p=0.06,但是却错误地认为他们之间完全不同,仅仅因为两个结果在阈值的两侧。第二个原因是p值本身并不能衡量效应的规模,所以相似的p值并不代表实际效应规模相似。即使两个结果得到完全一致的统计显著性结果,实际情况也有可能代表完全相反的情况。

Instead, think about statistical power. If we compare our new experimental drugs Fixitol and Solvix to a placebo but we don’t have enough test subjects to give us good statistical power, then we may fail to notice their benefits. If they have identical effects but we have only 50% power, then there’s a good chance we’ll say Fixitol has significant benefits and Solvix does not. Run the trial again, and it’s just as likely that Solvix will appear beneficial and Fixitol will not.

我们思考一下统计效应。如果我们比较两个新药Fixitol和Solvix,但是我们没有足够测试主体能给出令人信服的统计功效,那么我们就有可能无法测试出他们的效果。如果他们的功效范围一致,但是只有50%的统计功效,那么我们很有得出Fixitol有效而Solvix没有。重新进行该实验的话,很有可能得出新的结论,Solvix有效而Fixitol无效。

Instead of independently comparing each drug to the placebo, we should compare them against each other. We can test the hypothesis that they are equally effective, or we can construct a confidence interval for the extra benefit of Fixitol over Solvix. If the interval includes zero, then they could be equally effective; if it doesn’t, then one medication is a clear winner. This doesn’t improve our statistical power, but it does prevent the false conclusion that the drugs are different. Our tendency to look for a difference in significance should be replaced by a check for the significance of the difference.

相比对比每个药品相对安慰剂的功效,我们应该直接对比两种药品的功效。我们可以实施假设检验他们是否有同样的功效,或者我们构建一个Fixitol效果优于Solvix的置信区间。如果该区间包含了0,那么他们就有可能功效一致。如果没有包括0,那么明显一种药品要比另一种有效果。随着我们无法提升统计功效,但是却避免我们得出两个药品是不同的错误结论。我们应该将思路从“寻找显著差异的不同”变成“检查区别的显著性”。

Examples of this error in common literature and news stories abound. A huge proportion of papers in neuroscience, for instance, commit the error.44 You might also remember a study a few years ago suggesting that men with more biological older brothers are more likely to be homosexual.9 How did they reach this conclusion? And why older brothers and not older sisters?

在传统文献和新闻故事里出现的这种错误数不胜数。神经学科里很大一部分论文都承认了这种谬误。你可能还会记得几年前一项研究表明那些有哥哥们的男性更有可能是同性恋。他们是如何得到这种结果的?为什么是有哥哥而不是有姐姐的人?

The authors explain their conclusion by noting that they ran an analysis of various factors and their effect on homosexuality. Only the number of older brothers had a statistically significant effect; number of older sisters, or number of nonbiological older brothers, had no statistically significant effect.

作者的解释是在所有他们统计的可能影响同性恋的因素里,只有“哥哥数量”这一项的计算结果达到了统计显著性;而“姐姐数量”、“非亲生哥哥数量”都没有达到统计显著效果。

But as we’ve seen, that doesn’t guarantee that there’s a significant difference between the effects of older brothers and older sisters. In fact, taking a closer look at the data, it appears there’s no statistically significant difference between the effect of older brothers and older sisters. Unfortunately, not enough data was published in the paper to allow a direct calculation.22

但是如我们是所见,这并不能保证“哥哥数量”和“姐姐数量”的效应规模之间的差距是显著的。实际上,我们仔细检查数据就会发现,似乎这两个因素之间的效应规模并没有显著差异。可惜文章并没有足够的数据能计算出两者之间的不同。

When significant differences are missed

当我们误解了显著差异

The problem can run the other way. Scientists routinely judge whether a significant difference exists simply by eye, making use of plots like this one:

其他情况下也有出现该谬误的可能。科学家通常只是通过肉眼观察图表来判断是两组数据是够有显著差异,以下图为例:

Imagine the two plotted points indicate the estimated time until recovery from some disease in two different groups of patients, each containing ten patients. There are three different things those error bars could represent:

假设上图中两个点代表对比的两组患者疾病康复的预估时间,每组包括十个患者。以上的误差条可以有三种不同的解读:

  1. The standard deviation of the measurements. Calculate how far each observation is from the average, square each difference, and then average the results and take the square root. This is the standard deviation, and it measures how spread out the measurements are from their mean.

1.测量的标准方差。计算每次观察与其平均值之间的距离差值,将差值平方后算出平均值,最后再开方。计算标准方差是为了测量每次测量的结果在其均值附近的分布情况。

  1. The standard error of some estimator. For example, perhaps the error bars are the standard error of the mean. If I were to measure many different samples of patients, each containing exactly nsubjects, I can estimate that 68% of the mean times to recover I measure will be within one standard error of “real” average time to recover. (In the case of estimating means, the standard error is the standard deviation of the measurements divided by the square root of the number of measurements, so the estimate gets better as you get more data – but not too fast.) Many statistical techniques, like least-squares regression, provide standard error estimates for their results.

2.某一估计量的标准误差。例如,该误差条代表了标准误差的平均值。如果我想要测量大量不同的患者样本,每个样本包含n个患者,我们估计的平均恢复时间中68%的能落在真实平均恢复时间周围一个标准误差的范围内。(对平均预估量的例子而言,标准误差是标准方差除以样本数量的开方,所以所以当样本数量足够多的时候,估计越准)很多统计方法,如最小方差回归,就是提供标准误差的统计方法。

  1. The confidence interval of some estimator. A 95% confidence interval is mathematically constructed to include the true value for 95 random samples out of 100, so it spans roughly two standard errors in each direction. (In more complicated statistical models this may not be exactly true.)

3.某一估计量的置信区间。95%的置信区间是通过数学办法构建出的一个区间,使得100个随机样本中有95个都包含了的真实值,置信区间覆盖大约距离真实值两侧各2个标准误差的范围。(在更加复杂的统计模型下可能并不成立。)

These three options are all different. The standard deviation is a simple measurement of my data. The standard error tells me how a statistic, like a mean or the slope of a best-fit line, would likely vary if I take many samples of patients. A confidence interval is similar, with an additional guarantee that 95% of 95% confidence intervals should include the “true” value.

这三个解读各不相同。标准方差是数据最基础的一种测量方式。标准误差告诉我们诸如均值或者最匹配直线的斜率,这样的统计量是如何随着不同的患者样本而改变的。而95%的置信区间确保95%的置信区间包括了真实值。

In the example plot, we have two 95% confidence intervals which overlap. Many scientists would view this and conclude there is no statistically significant difference between the groups. After all, groups 1 and 2 might not be different – the average time to recover could be 25 in both groups, for example, and the differences only appeared because group 1 was lucky this time. But does this mean the difference is not statistically significant? What would the p value be?

上图所示途中展示了2个有重叠部分的95%置信区间。很多科学家会认为因此两组数据之间没有显著性差异。毕竟,组1与组2有可能并不是不同——虽然两组中都包含了25的恢复时间,所以有可能只是因为幸运原因组1显示出较短的恢复时间。但是是否这就代表两者之间没有显著差异呢?这个问题的p值是多少呢?

In this case, p<0.05. There is a statistically significant difference between the groups, even though the confidence intervals overlap.[1]

在这个案例中, p<0.05。所以两组之间是存在显著差异的,虽然他们的置信区间有重复的部分。

Unfortunately, many scientists skip hypothesis tests and simply glance at plots to see if confidence intervals overlap. This is actually a much more conservative test – requiring confidence intervals to not overlap is akin to requiring p<0.01 in some cases.50 It is easy to claim two measurements are not significantly different even when they are.

不幸的是,很多科学家跳过了检验假设,直接通过读图看置信区间知否重复。这其实是种更加保守的检验条件——在某些例子中,置信区间不重复的要求等同于要求p<0.01。所以很有可能错误地将有显著差异的测量判定为无显著差异。

Conversely, comparing measurements with standard errors or standard deviations will also be misleading, as standard error bars are shorter than confidence interval bars. Two observations might have standard errors which do not overlap, and yet the difference between the two is not statistically significant.

相反地,对比两个数据的标准误差或者标准方差也会出错,因为标准误差条通常比置信区间条要短。两组观测结果可能有不重合的标准误差,但是两者之间其实并不是统计显著。

A survey of psychologists, neuroscientists and medical researchers found that the majority made this simple error, with many scientists confusing standard errors, standard deviations, and confidence intervals.6 Another survey of climate science papers found that a majority of papers which compared two groups with error bars made the error.37 Even introductory textbooks for experimental scientists, such as An Introduction to Error Analysis, teach students to judge by eye, hardly mentioning formal hypothesis tests at all.

一项关于心理学家、神经学家和医学研究人员的调查发现,很多人都犯这个简单的错误,很多人甚至搞混标准误差,标准方差和置信区间。另一项关于气象科学论文的调查发现很多对比两组误差条的论文里有这样的错误。甚至在实验科学的入门教材,如误差分析简介,教学生通过读图来判断,而并未介绍正规的检验假设。

There are, of course, formal statistical procedures which generate confidence intervals which can be compared by eye, and even correct for multiple comparisons automatically. For example, Gabriel comparison intervals are easily interpreted by eye.19

当然,正规的统计流程计算出来的置信区间是可以通过读图来解读的,甚至可以做到自动校验多重对比。例如,Gabriel对比区间可以通过读图来解释。

Overlapping confidence intervals do not mean two values are not significantly different. Similarly, separated standard error bars do not mean two values are significantly different. It’s always best to use the appropriate hypothesis test instead. Your eyeball is not a well-defined statistical procedure.

拥有重复的置信区间并不意味着两个值是不具有统计差异性的。类似的,没有重合的标准误差条也不代表两个值是有显著差异的。人眼并不能代替严谨的统计流程。

https://www.statisticsdonewrong.com/significant-differences.html