I recently wrote a post, Short Selling Reduces Crashes about a paper which used an unusual random experiment by the SEC, Regulation SHO (which temporarily lifted short-sale constraints for randomly designated stocks), as a natural experiment. A correspondent writes to ask whether I was aware that Regulation SHO has been used by more than fifty other studies to test a variety of hypotheses. I was not! The problem is obvious. If the same experiment is used multiple times we should be imposing multiple hypothesis standards to avoid the green jelly bean problem, otherwise known as the false positive problem. Heath, Ringgenberg, Samadi and Werner make this point and test for false positives in the extant literature:
Natural experiments have become an important tool for identifying the causal relationships between variables. While the use of natural experiments has increased the credibility of empirical economics in many dimensions (Angrist & Pischke, 2010), we show that the repeated reuse of a natural experiment significantly increases the number of false discoveries. As a result, the reuse of natural experiments, without correcting for multiple testing, is undermining the credibility of empirical research.
.. To demonstrate the practical importance of the issues we raise, we examine two extensively studied real-world examples: business combination laws and Regulation SHO. Combined, these two natural experiments have been used in well over 100 different academic studies. We re-evaluate 46 outcome variables that were found to be significantly affected by these experiments, using common data frequency and observation window. Our analysis suggests that many of the existing findings in these studies may be false positives.
There is a second more subtle problem. If more than one of the effects are real it calls into question the exclusion restriction.To identify the effect of X on Y1 we need to assume that X influences Y1 along only one path. But if X also influences Y2 that suggests that there might be multiple paths from X to Y1. Morck and Young made this point many years ago, likening the reuse of the same instrumental variables to a tragedy of the commons.
Solving these problems is made especially difficult because they are collective action problems with a time dimension. A referee that sees a paper throw the dice multiple times may demand multiple hypothesis and exclusion test corrections. But if the problem is that there are many papers each running a single test, the burden on the referee to know the literature is much larger. Moreover, do we give the first and second papers a pass and only demand multiple hypothesis corrections for the 100th paper? That seems odd, although in practice it is what happens as more original papers can get published with weaker methods (collider bias!).
As I wrote in Why Most Published Research Findings are False we need to address these problems with a variety of approaches:
1) In evaluating any study try to take into account the amount of background noise. That is, remember that the more hypotheses which are tested and the less selection [this is one reason why theory is important it strengthens selection, AT] which goes into choosing hypotheses the more likely it is that you are looking at noise.
2) Bigger samples are better. (But note that even big samples won’t help to solve the problems of observational studies which is a whole other problem).
3) Small effects are to be distrusted.
4) Multiple sources and types of evidence are desirable.
5) Evaluate literatures not individual papers.
6) Trust empirical papers which test other people’s theories more than empirical papers which test the author’s theory.
7) As an editor or referee, don’t reject papers that fail to reject the null.