By Gabriel Lenz (UC Berkeley)
Like many researchers, I worry constantly about whether findings are true or merely the result of a process variously called data mining, fishing, capitalizing on chance, or p-hacking. Since academics face extraordinary incentives to produce novel results, many suspect that “torturing the data until it speaks” is a common practice, a suspicion reinforced by worrisome replication results (1,2).
Data torturing likely slows down the accumulation of knowledge, filling journals with false positives. Pre-analysis plans can help solve this problem. They may also help with another perverse consequence that has received less attention: a preference among many researchers for very simple approaches to analysis.
This preference has developed, I think, as a defense against data mining. For example, one of the many ways researchers can torture their data is with control variables. They can try different sets of control variables, they can recode them in various ways, and they can interact them with each other until the analysis produces the desired result. Since we almost never know exactly which control variables really do influence the outcome, researchers can usually tell themselves a story about why they chose the set or sets they publish. Since control variables could be “instruments of torture,” I’ve learned to secure my wallet whenever I see results presented with controls. Even though the goal of control variables is to rule out alternative explanations, I often find bivariate results more convincing. My sense is that many of my colleagues share these views, preferring approaches that avoid control variables, such as difference-in-differences estimators. In a sense, avoiding controls partially disarms the torturer.