R e f i n i n g t h e C o n C e p t o f S C i e n t i f i C i n f e R e n C e W h e n W o R k i n g W i t h B i g D a t a
4
irreproducible results. Thus, big data analytics offers tremendous opportunities
but is simultaneously characterized by numerous potential pitfalls, said Daniels.
With such abundant, messy, and complex data, “statistical principles could hardly
be more important,” concluded Hogan.
Andrew Nobel cautioned that “big data isn’t necessarily the right data” for
answering a specific question. He alluded to the fundamental importance of defin-
ing the question of interest and assessing the suitability of the available data to
support inferences about that question. Across the 2-day workshop, there was
notable variety in the inferential tasks described; for example, Sebastien Haneuse
described a comparative effectiveness study of two antidepressants to draw infer-
ences about differential effects on weight gain, whereas Daniela Witten described
the use of inferential tools to aid in scientific discovery. Some presenters remarked
that big data may invite analysts to overuse exploratory analyses to define research
questions and underemphasize the fundamental issues of data suitability and bias.
Understanding bias is particularly important with large, complex data sets such as
EHRs, explained Daniels, as analysts may not have control over sample selection
among other sources of bias. Alfred Hero explained that when working with large
data sets that contain information on many diverse variables, quantifying bias and
understanding the conditions necessary for replicability can be particularly chal-
lenging. Haneuse encouraged researchers using EHRs to compare available data to
those data that would result from the ideal randomized trial as a strategy to define
missing data and explore selection bias. More broadly, when analyses of big data
are used for scientific discovery, to help form scientific conclusions, or to inform
decision making, statistical reasoning and inferential formalism are required.
Inference Requires Evaluating Uncertainty
Many workshop presenters described significant advances made in develop-
ing algorithms and methods for analyzing large, complex data sets. However, a
recurring topic of discussion was that most work to date stops short of formally
assessing the uncertainty associated with the predictions or comparisons made
with big data (as mentioned in the presentations by Michael Daniels, Alfred Hero,
Genevera Allen, Daniela Witten, Michael Kosorok, and Bin Yu). For example, data
mining algorithms that generate network structures representing a snapshot of
complex genetic processes are of limited value without some understanding of the
reliability of the nodes and edges identified, which in this case correspond to spe-
cific genes and potential regulatory relationships, respectively. In an applied setting,
Allen and Witten suggested using several estimation techniques on a single data
set and similarly using a single estimation technique with random subsamples of
the observations. In practice, results that hold up across estimation techniques and
across subsamples of the data are more likely to be scientifically useful. While this