CEGA is hosting a series of opinion pieces by leading scholars, all addressing issues of transparency in empirical social sciences. CLICK HERE to read and comment.
By Carson Christiano (CEGA)
You may wonder why a network of development researchers is taking the lead on a transparency initiative. The answer lies in the profound and omnipresent power of failure.
Most would agree that risk-taking is essential to innovation, whether we’re talking about creating a simple hand-washing station or a state-of-the-art suspension bridge. At the same time, we tend to highlight our successes while downplaying ambiguous research results, misguided technologies, and projects that fail to achieve their desired impact. We fear humiliation and the curtailment of donor interest. Yet open discussion about what doesn’t work, in addition to what works, is critical to our eventual success as innovators. We at the Center for Effective Global Action (CEGA), like so many others working towards social change, believe strongly that there should be “no silent failures” in development.
In Silicon Valley, failure is regulated by the market. Venture capitalists don’t invest in technologies that consumers won’t buy. In the social impact space, particularly in developing countries where consumer demand is difficult to quantify, donors and governments rely on loose, assumption-laden predictions of return on investment. As millions of people in poor countries around the world stand to benefit (and potentially lose) from large-scale social and economic development programs, it follows that CEGA and our research partners maintain a steadfast commitment to research transparency as a moral imperative.
That being said, it is infinitely easier to commit to the concept of research transparency than to actually engage in it. We all know that writing a pre-analysis plan takes precious time and resources; study registration holds us accountable for the results of our research, which may not turn out as we expect. How, then, can we change behavior around research transparency, and encourage researchers to accept (and admit) failure?
By Edward Miguel (Economics, UC Berkeley)
This CEGA Blog Forum builds on a seminal research meeting held at the University of California, Berkeley on December 7, 2012. The goal was to bring together a select interdisciplinary group of scholars – from biostatistics, economics, political science and psychology – with a shared interest in promoting transparency in empirical social science research.
There has been a flurry of activity regarding research transparency in recent years, within the academy and among research funders, driven by a recognition that too many influential research findings are fragile at best, if not entirely spurious or even fraudulent. But the increasingly heated debates on these critical issues have until now been “siloed” within individual academic disciplines, limiting their synergy and broader impacts. The December meeting (see presentations and discussions) drove home the point that there is a remarkable degree of commonality in the interests, goals and challenges facing scholars across the social science disciplines.
This inaugural CEGA Blog Forum aims to bring the fascinating conversations that took place at the Berkeley meeting to a wider audience, and to spark a public dialogue on these critical issues with the goal of clarifying the most productive ways forward. This is an especially timely debate, given: the American Economic Association’s formal decision in 2012 to establish an online registry for experimental studies; the new “design registry” established by the Experiments in Governance and Politics, or EGAP, group; serious discussion about a similar registry in the American Political Science Association’s Experimental Research section; and the emergence of the Open Science Framework, developed by psychologists, as a plausible platform for registering pre-analysis plans and documenting other aspects of the research process. Yet there remains limited consensus regarding how exactly study registration will work in practice, and about the norms that could or should emerge around it. For example, is it possible – or even desirable – for all empirical social science studies to be registered? When and how should study registration be considered by funders and journals?
By Donald P. Green (Political Science, Columbia)
Not long ago, I attended a talk at which the presenter described the results of a large, well-crafted experiment. His results indicated that the average treatment effect was close to zero, with a small standard error. Later in the talk, however, the speaker revealed that when he partitioned the data into subgroups (men and women), the findings became “more interesting.” Evidently, the treatment interacts significantly with gender. The treatment has positive effects on men and negative effects on women.
A bit skeptical, I raised my hand to ask whether this treatment-by-covariate interaction had been anticipated by a planning document prior to the launch of the experiment. The author said that it had. The reported interaction now seemed quite convincing. Impressed both by the results and the prescient planning document, I exclaimed “Really?” The author replied, “No, not really.” The audience chuckled, and the speaker moved on. The reported interaction again struck me as rather unconvincing.
Why did the credibility of this experimental finding hinge on pre-registration? Let’s take a step back and use Bayes’ Rule to analyze the process by which prior beliefs were updated in light of new evidence. In order to keep the algebra to a bare minimum, consider a stylized example that makes use of Bayes’ Rule in its simplest form.
By Macartan Humphreys (Political Science, Columbia & EGAP)
I am sold on the idea of research registration. Two things convinced me.
First I have been teaching courses in which each week we try to replicate prominent results produced by political scientists and economists working on the political economy of development. I advise against doing this because it is very depressing. In many cases data is not available or results cannot be replicated even when it is. But even when results can be replicated, they often turn out to be extremely fragile. Look at them sideways and they fall over. The canon is a lot more delicate than it lets on.
Second I have tried out registration for myself. That was also depressing, this time because of what I learned about how I usually work. Before doing the real analysis on data from a big field experiment on development aid in Congo, we (Raul Sanchez de la Sierra, Peter van der Windt and I) wrote up a “mock report” using fake data on our outcome variables. Doing this forced us to make myriad decisions about how to do our analysis without the benefits of seeing how the analyses would play out. We did this partly for political reasons: a lot of people had a lot invested in this study and if they had different ideas about what constituted evidence, we wanted to know that upfront and not after the results came in. But what really surprised us was how hard it was to do it. I found that not having access to the results made it all the more obvious how much I am used to drawing on them when crafting analyses and writing; for simple decisions such as which exact measure to use for a given concept, which analyses to deepen, and which results to emphasize. More broadly that’s how our discipline works: the most important peer feedback we receive, from reviewers or in talks, generally comes after our main analyses are complete and after our peers are exposed to the patterns in the data. For some purposes that’s fine, but it is not hard to see how it could produce just the kind of fragility I was seeing in published work.
These experiences convinced me that our current system is flawed. Registration offers one possible solution.
By Maya Petersen, Alan Hubbard, and Mark van der Laan (Public Health, UC Berkeley)
Statistics provide a powerful tool for learning about the world, in part because they allow us to quantify uncertainty and control how often we falsely reject null hypotheses. Pre-specified study designs, including analysis plans, ensure that we understand the full process, or “experiment”, that resulted in a study’s findings. Such understanding is essential for valid statistical inference.
The theoretical arguments in favor of pre-specified plans are clear. However, the practical challenges to implementing such plans can be formidable. It is often difficult, if not impossible, to generate a priori the full universe of interesting questions that a given study could be used to investigate. New research, external events, or data generated by the study itself may all suggest new hypotheses. Further, huge amounts of data are increasingly being generated outside the context of formal studies. Such data provide both a tremendous opportunity and a challenge to statistical inference.
Even when a hypothesis is pre-specified, pre-specifying an analysis plan to test the hypothesis is often challenging. For example, investigation of the effect of compliance to a randomly assigned intervention forces us to specify how we will contend with confounding. What identification strategy should we use? Which covariates should we adjust for? How should we adjust for them? The number of analytic decisions and the impact of these decisions on conclusions is further multiplied when losses to follow up, biased sampling, and missing data are considered.
By David Laitin (Political Science, Stanford)
My claim in this blog entry is that political science will remain principally an observation-based discipline and that our core principles of establishing findings as significant should consequently be based upon best practices in observational research. This is not to deny that there is an expanding branch of experimental studies which may demand a different set of principles; but those principles add little to confidence in observational work. As I have argued elsewhere (“Fisheries Management” in Political Analysis 2012), our model for best practices is closer to the standards of epidemiology than to that of drug trials. Here, through a review of the research program of Michael Marmot (The Status Syndrome, New York: Owl Books, 2004), I evoke the methodological affinity of political science and epidemiology, and suggest the implications of this affinity for evolving principles of transparency in the social sciences.
Two factors drive political science into the observational mode. First, as with the Center for Disease Control that gets an emergency call describing an outbreak of some hideous virus in a remote corner of the world, political scientists see it as core to their domain to account for anomalous outbreaks (e.g. that of democracy in the early 1990s) wherever they occur. Not unlike epidemiologists seeking to model the hazard of SARS or AIDS, political scientists cannot randomly assign secular authoritarian governments to some countries and orthodox authoritarian governments to others to get an estimate of the hazard rate into democracy. Rather, they merge datasets looking for patterns; theorizing about them; and then putting the implications of the theory to test with other observational data. Accounting for outcomes in the real world drives political scientists into the observational mode.
By Gabriel Lenz (UC Berkeley)
Like many researchers, I worry constantly about whether findings are true or merely the result of a process variously called data mining, fishing, capitalizing on chance, or p-hacking. Since academics face extraordinary incentives to produce novel results, many suspect that “torturing the data until it speaks” is a common practice, a suspicion reinforced by worrisome replication results (1,2).
Data torturing likely slows down the accumulation of knowledge, filling journals with false positives. Pre-analysis plans can help solve this problem. They may also help with another perverse consequence that has received less attention: a preference among many researchers for very simple approaches to analysis.
This preference has developed, I think, as a defense against data mining. For example, one of the many ways researchers can torture their data is with control variables. They can try different sets of control variables, they can recode them in various ways, and they can interact them with each other until the analysis produces the desired result. Since we almost never know exactly which control variables really do influence the outcome, researchers can usually tell themselves a story about why they chose the set or sets they publish. Since control variables could be “instruments of torture,” I’ve learned to secure my wallet whenever I see results presented with controls. Even though the goal of control variables is to rule out alternative explanations, I often find bivariate results more convincing. My sense is that many of my colleagues share these views, preferring approaches that avoid control variables, such as difference-in-differences estimators. In a sense, avoiding controls partially disarms the torturer.
By Richard Sedlmayr (Philanthropic Advisor)
When we picture a desperate student running endless tests on his dataset until some feeble point finally meets statistical reporting conventions, we are quick to dismiss the results. But the underlying issue is ubiquitous: it is hard to analyze data without getting caught in a hypothesis drift, and if you do not seriously consider the repercussions on statistical inference, you too are susceptible to picking up spurious correlations. This is also true for randomized trials that otherwise go to great lengths to ensure clean causal attribution. But experimental (and other prospective) research has a trick up its sleeve: the pre-analysis plan (PAP) can credibly overcome the problem by spelling out subgroups, statistical specifications, and virtually every other detail of the analysis before the data is in. This way, it can clearly establish that tests are not a function of outcomes – in other words, that results are what they are.
So should PAPs become the new reality for experimental research? Not so fast, say some, because there are costs involved. Obviously, it takes a lot of time and effort to define the meaningful analysis of a dataset that isn’t even in yet. But more importantly, there is a risk that following a PAP backfires and actually reduces the value we get out of research: perhaps one reason why hypothesis drift is so widespread because it is a cost-effective way of learning, and by tying our hands, we might stifle the valuable processes can only take place once data is in. Clearly, powerful insights that came out of experimental work – both in social and biomedical research – have been serendipitous. So are we stuck in limbo, “without a theory of learning” that might provide some guidance on PAPs?
By Kevin M. Esterling (Political Science, UC Riverside)
Whenever I discuss the idea of hypothesis preregistration with colleagues in political science and in psychology, the reactions I get typically range from resistance to outright hostility. These colleagues obviously understand the limitations of research founded on false-positives and data over-fitting. They are even more concerned, however, that instituting a preregistry would create norms that would privilege prospective, deductive research over exploratory inductive and descriptive research. For example, such norms might lead researchers to neglect problems or complications in their data so as to retain the ability to state their study “conformed” to their original registered design.
If a study registry were to become widely used in the discipline, however, it would be much better if it were embraced and seen as constructive and legitimate. One way I think we can do this is by shifting the focus away from monitoring our colleagues’ compliance with registration norms, which implicitly privileges prospective research, and instead towards creating institutions that promote transparency in all styles of research, with preregistration being just one element of the new institutions for transparency.
Transparency solves the same problems that preregistration is intended to address, in that transparency helps other researchers to understand the provenance of results and enables researchers to value contributions for what they are. If scholars genuinely share the belief that data driven research has scientific merit, then there really should be no stigma for indicating that is how one reached one’s conclusions. Indeed, creating transparency should enable principled inductive research since it creates legitimacy for this research and it removes the awkward need to state inductive research as if it had been deductive.
By Temina Madon (CEGA, UC Berkeley)
As we all know, experimentation in the natural sciences far predates the use of randomized, controlled trials (RCTs) in medicine and the social sciences; some of the earliest controlled experiments were conducted in the 1920s by RA Fisher, an agricultural scientist evaluating new crop varieties across randomly assigned plots of land. Today, controlled experiments are the mainstay of research in plant, microbial, cellular and molecular biology. So what can we learn from efforts to improve research transparency in the natural sciences? And given that software engineers are now embracing experimental methods, are there lessons to be learned from computer science?
In the life sciences, advances in molecular biology and the genomics revolution—coupled with improvements in robotics and process automation—have enabled massively high-throughput data collection. Biologists can now quantitatively measure the expression of tens of thousand of genes (i.e. dependent variables) across hundreds or thousands of samples, under multiple experimental conditions. As a result, the volume of data has expanded exponentially in just a few years. And while data sharing has also improved, there is concern about the reproducibility and validity of results (particularly when hypotheses are not well defined in advance… which can quickly turn experiments into fishing expeditions). How are scientists addressing this issue?
The scientific community has identified several social norms that erode the integrity and transparency of quantitative ‘omic research. One issue is that the ‘methods’ sections of articles describing empirical studies have steadily contracted over time, providing ever fewer details of study design. Today’s publications provide little guidance for scientists wanting to re-run another researcher’s study. To combat this problem, a group of biologists has begun promoting minimum reporting guidelines for several classes of experiments. An analog for the social sciences might be a voluntary checklist for the reporting of intervention protocols, survey instruments, and methods (or scripts) for data cleaning, filtering, and analysis. This would include the disclosure of all tests of a given hypothesis, not just the ones that yield “interesting” results.