By Gabriel Lenz (UC Berkeley)
Like many researchers, I worry constantly about whether findings are true or merely the result of a process variously called data mining, fishing, capitalizing on chance, or p-hacking. Since academics face extraordinary incentives to produce novel results, many suspect that “torturing the data until it speaks” is a common practice, a suspicion reinforced by worrisome replication results (1,2).
Data torturing likely slows down the accumulation of knowledge, filling journals with false positives. Pre-analysis plans can help solve this problem. They may also help with another perverse consequence that has received less attention: a preference among many researchers for very simple approaches to analysis.
This preference has developed, I think, as a defense against data mining. For example, one of the many ways researchers can torture their data is with control variables. They can try different sets of control variables, they can recode them in various ways, and they can interact them with each other until the analysis produces the desired result. Since we almost never know exactly which control variables really do influence the outcome, researchers can usually tell themselves a story about why they chose the set or sets they publish. Since control variables could be “instruments of torture,” I’ve learned to secure my wallet whenever I see results presented with controls. Even though the goal of control variables is to rule out alternative explanations, I often find bivariate results more convincing. My sense is that many of my colleagues share these views, preferring approaches that avoid control variables, such as difference-in-differences estimators. In a sense, avoiding controls partially disarms the torturer.
The same is true for many other aspects of data analysis, including the choice of functional forms, the measurement of variables, the type of standard errors, etc. In all these domains and others, the simplest possible specification disarms the researcher, reducing concerns about data mining. In part, this explains the preference among labor economist for simple experiments, difference in differences, ordinary least squares, regression discontinuity, etc.
This preference for simple methods, however, has pernicious consequences, as more complex approaches are sometimes appropriate. Control variables often do reduce concerns about confounders. Structural estimation may help the researcher generalize findings. Least squares is not always the appropriate estimator. Many of us avoid the estimation techniques we learned in grad school, not because they are inappropriate, but because we know they will raise suspicion among reviewers even when they are appropriate. Although using the simplest suitable method is always the best strategy, the payoff from complex analysis can be large. Apparently, the three-fold gain in the accuracy of hurricane forecasts since the 1980s has come in part through structural estimation that incorporates long-understood science (1).
Pre-analysis plans offer a solution to this conundrum. By specifying the model in advance, researchers can undertake more complicated analysis without raising the same suspicion. The assumptions behind prespecified models may or may not be valid, but invalid ones will no longer bias estimates towards the desired finding. When control variables are necessary to reveal a finding, for instance, researchers can prespecify their coding and inclusion. Not only do pre-analysis plans have the potential to restore confidence in the scientific process, but they may also therefore free researchers to use the powerful statistical techniques we spent so much time learning in graduate school.
This post is one of a ten-part series in which we ask researchers and experts to discuss transparency in empirical social science research across disciplines. The next post in the series is “The Need for Pre-Analysis: First Things First” by Richard Sedlmayr. You can find the complete list of posts here.