Standard methods for hypothesis testing cannot deal with optional stopping: instead one must specify a sampling plan (such as “test 500 subjects with new medication; test 500 with placebo”). If you stop early (because e.g. results may look futile or extremely strong), the results become essentially uninterpretable. If, after 1000 subjects, the results look promising but nonconclusive, you cannot simply go on and test some more. Combining the data and calculating p-values as if the new data were fixed in advance gives wildly wrong results, usually overstating evidence in favor of the treatment (“against the null”). Problems get even more serious if data comes from a ‘similar but slightly different population’, i.e. a different hospital.
There exist methods though which can deal with such optional stopping and continuation. The researcher may stop at any time she likes, for whatever reason; and she may continue if more money or patients become available. The results can keep being monitored and remain statistically valid – Type I error guarantees are preserved (see technical remark below). This goes far beyond classical sequential testing (where a researcher has to stop if evidence reaches a certain threshold) and group-sequential testing (where a researcher has to specify a final sample size in advance and can only monitor a pre-specified number of times; similar issues apply to alpha-spending approaches). Also adding new batches of data with different effect size/variability is accounted for.
Such always-valid methods for testing have been developed by our CWI group under the name safe tests. The original ideas go back to famous mathematicians Robbins, Lai (1970s), Vovk and Shafer (2000s), but until now it has never caught on in practice. It is very practical though.
Currently, we have R software for the simplest type of tests: 1-sample and 2-sample t-tests and 2×2 table testing. The software can be downloaded here, and here is the vignette, providing a gentle explanation of the ideas.
However, these methods can be implemented for a wide variety of other estimation and testing setups (including nonparametric). While we have no time to do such implementations ourselves at short notice, do contact us if you want more information. In these critical times, we might be able to help, for example by reanalyzing existing data.
– Q: Can I read more? A: A simple explanation of the main ideas for t-tests and 2×2 tests is in the vignette of the R software, to be found here. The general framework, explaining how such tests can be developed in general is in the (unfortunately for these times, very technical) paper Safe Testing at https://arxiv.org/abs/1906.07801.
–Q: what does preservation of error guarantees mean? A: It means that for any chosen α (e.g. 0.05) the probability that 1/S (our analogue of p-value) ever, at any point in time, dives below α is smaller than α .
– Q: can’t I just use Bayes factors for testing, since it is claimed they can handle optional stopping? A: Bayes factors satisfy a weaker form of optional stopping than safe tests – when used with optional stopping, the results do make more sense than with traditional testing. In some cases (e.g. 2×2 contingency tables) the results are not as stringently valid as with safe tests. In other cases (e.g. the Bayesian t-test) the results are just as stringently valid, and Bayesian and safe tests become very similar.
– Q: what about confidence intervals rather than tests? A: Standard confidence intervals have the same issues as standard tests – the intervals loose validity under optional stopping. There do exist always-valid confidence intervals though, which do allow for optional stopping.