【Translation】Stopping rules and regression to the mean

Stopping rules and regression to the mean

终止规则和均值回归

Medical trials are expensive. Supplying dozens of patients with experimental medications and tracking their symptoms over the course of months takes significant resources, and so many pharmaceutical companies develop “stopping rules,” which allow investigators to end a study early if it’s clear the experimental drug has a substantial effect. For example, if the trial is only half complete but there’s already a statistically significant difference in symptoms with the new medication, the researchers may terminate the study, rather than gathering more data to reinforce the conclusion.

医学试验是昂贵的。进行数十个患者的医疗试验并长时间记录他们的症状改变需要大量的人力物力,因此许多制药厂家制定了“终止规则”,这使得投资者能加速研究进程。例如,如果该试验只完成了一般,但是患者症状已经发现了统计显著性差异的好转,研究人员就会停止继续该试验,而不是继续进行获得更多的试验来加强该结论。

When poorly done, however, this can lead to numerous false positives.

然而,如果做的不合适,这种终止规则会导致大量的假阳性结论。

For example, suppose we’re comparing two groups of patients, one with a medication and one with a placebo. We measure the level of some protein in their bloodstreams as a way of seeing if the medication is working. In this case, though, the medication causes no difference whatsoever: patients in both groups have the same average protein levels, although of course individuals have levels which vary slightly.

例如,假设我们在对比两组患者,一组用药一组用安慰剂。我们通过测量他们血液中的某种蛋白来测试该药品是否发挥作用。在这种情况下,药品实际上没有什么效果:虽然从个人角度而言各有不同,但两组患者的蛋白含量一致。

We start with ten patients in each group, and gradually collect more data from more patients. As we go along, we do a t test to compare the two groups and see if there is a statistically significant difference between average protein levels. We might see a result like this simulation:

我们假设每组有十位患者,当我们收集的数据越来越多,我们进行t试验来比较两组是否在蛋白含量上有显著差异。我们得到的结果可能如下仿真:

This plot shows the p value of the difference between groups as we collect more data, with the horizontal line indicating the p=0.05 level of significance. At first, there appears to be no significant difference. Then we collect more data and conclude there is. If we were to stop, we’d be misled: we’d believe there is a significant difference between groups when there is none. As we collect yet more data, we realize we were mistaken – but then a bit of luck leads us back to a false positive.

改图显示不同样本大小下的p值变化,其中横虚线为p=0.05的显著性基线。一开始,似乎没有显著差异,随着数据增多结论变成有差异,但是如果我们就此停止,我们就会被误导:我们认为两组数据间有显著差异实际上没有。随着数据继续增加,我们就会发现我们错了,虽然我们仍有小概率会得出假阳性的结论。

You’d expect that the p value dip shouldn’t happen, since there’s no real difference between groups. After all, taking more data shouldn’t make our conclusions worse, right? And it’s true that if we run the trial again we might find that the groups start out with no significant difference and stay that way as we collect more data, or start with a huge difference and quickly regress to having none. But if we wait long enough and test after every data point, we will eventually cross any arbitrary line of statistical significance, even if there’s no real difference at all. We can’t usually collect infinite samples, so in practice this doesn’t always happen, but poorly implemented stopping rules still increase false positive rates significantly.53

你可能并不希望p值低于0.05,既然两组数据之间实际上并没有差别。而且按理说数据越多,结论越准不是吗?没错,如果我们再进行一次实验,我们可能一开发现没有显著差异,然后随着数据量增长保持一样的结果,也有可能一概是有一个非常显著的差异然后迅速回归的无差异。但是不管我们设置怎样的统计显著的基线,如果我们收集更长时间的数据,在每个数据点后进行试验,最终我们都会穿过该基线,即使实际上两组数据上根本没有差异。而我们又不可能收集无限多的样本,所以从实际层面我们也很少遇到上述的情况,但是错误地使用终止规则会大量提高假阳性结论的概率。

Modern clinical trials are often required to register their statistical protocols in advance, and generally pre-select only a few evaluation points at which they test their evidence, rather than testing after every observation. This causes only a small increase in the false positive rate, which can be adjusted for by carefully choosing the required significance levels and using more advanced statistical techniques.56 But in fields where protocols are not registered and researchers have the freedom to use whatever methods they feel appropriate, there may be false positive demons lurking.

现代临床试验经常需要提前设置统计协议,提前设置少量测试评估点,而非每次观测后都进行试验。这就增加了假阳性结论的可能,而这种错误是可以通过谨慎选择显著性等级和其他高级统计方法来调整的。但是在没有明确注册协议以及研究人员可以自由发挥的部分,仍然有假阳性的问题存在。

 

Truth inflation

真值膨胀

Medical trials also tend to have inadequate statistical power to detect moderate differences between medications. So they want to stop as soon as they detect an effect, but they don’t have the power to detect effects.

医药试验在检测不同药品的中等差异时往往统计功效不足。所以科研人员也希望在检测出效果后就终止继续试验,但其实他们并没有足够的功效来证明效果检测。

Suppose a medication reduces symptoms by 20% over a placebo, but the trial you’re using to test it does not have adequate statistical power to detect this difference. We know that small trials tend to have varying results: it’s easy to get ten lucky patients who have shorter colds than usual, but much harder to get ten thousand who all do.

假设某项药品比安慰剂能降低20%的症状,但是你所使用的试验并没有足够的功效检验这个规模的不同。我们知道小规模的试验通常会得到各种不同的结论:很有可能测试组中十位患者的身体素质高本身就恢复快,但是一组包括一万个这样的患者就很难了。

Now imagine running many copies of this trial. Sometimes you get unlucky patients, and so you don’t notice any statistically significant improvement from your drug. Sometimes your patients are exactly average, and the treatment group has their symptoms reduced by 20% – but you don’t have enough data to call this a statistically significant increase, so you ignore it. Sometimes the patients are lucky and have their symptoms reduced by much more than 20%, and so you stop the trial and say “Look! It works!”

现在假设我们进行了很多组这样的试验。有时不巧患者的身体条件不好,你没有发现你的药品有任何显著提升。有时患者很平均,你恰好得到20%的疗效,但是你的数据量不够,无法断定有显著功效,所以你也忽视了。有时患者身体条件好,你得到远远高于20%的效果,然后你就停止试验,然后断定“看!药品有效!”

You’ve correctly concluded that your medication is effective, but you’ve inflated the size of its effect. You falsely believe it is much more effective than it really is.

虽然你得出了药品有效的结论,但是实际上你夸大了药品的效应规模。你错误地认为药品的效果比实际上要大很多。

This effect occurs in pharmacological trials, epidemiological studies, gene association studies (“gene A causes condition B”), psychological studies, and in some of the most-cited papers in the medical literature.3032 In fields where trials can be conducted quickly by many independent researchers (such as gene association studies), the earliest published results are often wildly contradictory, because small trials and a demand for statistical significance cause only the most extreme results to be published.33

这种情况大量出现在药学试验、流行病研究、基因关联研究(“基因A导致B性状”),心理学研究和医学文献里常常引用的论文里。在那些独立研究人员进行的试验中(例如基因关联研究),早期的发表结果通常是大相径庭,因为试验规模太小,而人们盲目追求统计差异性导致发表的试验大都是那些极端情况。

As a bonus, truth inflation can combine forces with early stopping rules. If most drugs in clinical trials are not quite so effective to warrant stopping the trial early, then many trials stopped early will be the result of lucky patients, not brilliant drugs – and by stopping the trial we have deprived ourselves of the extra data needed to tell the difference. Reviews have compared trials stopped early with other studies addressing the same question which did not stop early; in most cases, the trials stopped early exaggerated the effects of their tested treatments by an average of 29%.3

作为奖励,真值膨胀加剧了早期终止规则的效果。如果大部分临床试验中的药品都不具备执行终止试验的高效,那么那些执行了终止规则的试验实际上只是侥幸遇到患者身体条件好,而不是药品有效。而进行终止试验我们就停止收集更多数据来验证真实的差异。有综述对比同一研究采取早期终止和继续收集数据试验的案例,发现那些早期终止的试验平均夸大疗效29%。

Of course, we do not know The Truth about any drug being studied, so we cannot tell if a particular study stopped early due to luck or a particularly good drug. Many studies do not even publish the original intended sample size or the stopping rule which was used to justify terminating the study.43A trial’s early stoppage is not automatic evidence that its results are biased, but it is a suggestive detail.

当然,我们并不清楚我们研究药品的真实效果,所以我们也没办法确定我们提前终止的试验是因为运气还是真的有效。很多研究甚至没有声明所需的样本大小或者该研究下可行的终止规则。试验过早终止并不是结果有偏见的先天性原因,但是需要更多详尽的解释。

Little extremes

少数的极端现象

Suppose you’re in charge of public school reform. As part of your research into the best teaching methods, you look at the effect of school size on standardized test scores. Do smaller schools perform better than larger schools? Should you try to build many small schools or a few large schools?

假设你负责公立学校的改革。最好的教学方式是你研究的一部分,你希望区分标准考试分数与学校规模之间的关系。是否学生少的学校教学质量高于人多的学校呢?我们应该建立更多的小规模学校还是少量的大规模学校呢?

To answer this question, you compile a list of the highest-performing schools you have. The average school has about 1,000 students, but the top-scoring five or ten schools are almost all smaller than that. It seems that small schools do the best, perhaps because of their personal atmosphere where teachers can get to know students and help them individually.

要解答这个问题,你列出一系列教学质量高的学校。平均每间学校有大约1000个学生,但是列表里前五或者前十的学校人数都小于平均值。看上去好像小学校做的更好一些,也许因为小范围里老师能够更好地了解和指导每个学生。

Then you take a look at the worst-performing schools, expecting them to be large urban schools with thousands of students and overworked teachers. Surprise! They’re all small schools too.

然后你把排名最差的学校也拿出来,猜想这些学校可能是城市里拥有上万学生、师资力量严重不足的大型学校。但是,但是,事实上你看到的仍然是小规模学校。

What’s going on? Well, take a look at a plot of test scores vs. school size:

到底发生了什么?我们看一下考试成绩与学校规模的散点图:

Smaller schools have more widely varying average test scores, entirely because they have fewer students. With fewer students, there are fewer data points to establish the “true” performance of the teachers, and so the average scores vary widely. As schools get larger, test scores vary less, and in fact increase on average.

小规模学校在平均成绩附近的幅度要更广,这完全是因为他们的学生他太少了。人数太少,建立真实教学质量的数据点就少,平均成绩的幅度自然就很广泛。当学校规模变大以后,考试成绩的幅度就变小了,这样也提高了平均值。

This example used simulated data, but it’s based on real (and surprising) observations of Pennsylvania public schools.59

这个例子使用的是仿真数据,但是是基于宾夕法尼亚公立学校的真实(令人意外的)观察得到的。

Another example: In the United States, counties with the lowest rates of kidney cancer tend to be Midwestern, Southern and Western rural counties. How could this be? You can think of many explanations: rural people get more exercise, inhale less polluted air, and perhaps lead less stressful lives. Perhaps these factors lower their cancer rates.

另一个例子是:在美国,貌似拥有最低肾癌率的州在中西部、南部和西部的乡下州。为什么会这样?你可以想到很多可能的原因:乡下的人运动量更大,呼吸的受污染的空气更少,也有可能生活压力更小。也许是这些因素降低了肾癌的患病率。

On the other hand, counties with the highest rates of kidney cancer tend to be Midwestern, Southern and Western rural counties.

然而另一方面,肾癌患病率最高的仍然是这些地区的州。

The problem, of course, is that rural counties have the smallest populations. A single kidney cancer patient in a county with ten residents gives that county the highest kidney cancer rate in the nation. Small counties hence have vastly more variable kidney cancer rates, simply because they have so few residents.21

这个问题,很显然,是因为这些州的人口相对来说较少。如果一个州就十个人,那么一个患者就足以给出全国最高的患病率。人口小的州的患癌率的浮动很高,简单讲就是因为人太少。