Targeted Learning from Data: Valid Statistical Inference Using Data Adaptive Methods 1

By Maya Petersen, Alan Hubbard, and Mark van der Laan (Public Health, UC Berkeley)

Statistics provide a powerful tool for learning about the world, in part because they allow us to quantify uncertainty and control how often we falsely reject null hypotheses. Pre-specified study designs, including analysis plans, ensure that we understand the full process, or “experiment”, that resulted in a study’s findings. Such understanding is essential for valid statistical inference.

The theoretical arguments in favor of pre-specified plans are clear. However, the practical challenges to implementing such plans can be formidable. It is often difficult, if not impossible, to generate a priori the full universe of interesting questions that a given study could be used to investigate. New research, external events, or data generated by the study itself may all suggest new hypotheses. Further, huge amounts of data are increasingly being generated outside the context of formal studies. Such data provide both a tremendous opportunity and a challenge to statistical inference.

Even when a hypothesis is pre-specified, pre-specifying an analysis plan to test the hypothesis is often challenging. For example, investigation of the effect of compliance to a randomly assigned intervention forces us to specify how we will contend with confounding.  What identification strategy should we use? Which covariates should we adjust for? How should we adjust for them? The number of analytic decisions and the impact of these decisions on conclusions is further multiplied when losses to follow up, biased sampling, and missing data are considered.