Copyright © National Academy of Sciences. All rights reserved.
Frontiers in Massive Data Analysis
SUMMARY 3
data-analysis methods that can be applied to ever-larger data sets. How-
ever, such optimism must be tempered by an understanding of the major
difficulties that arise in attempting to achieve the envisioned goals. In part,
these difficulties are those familiar from implementations of large-scale
databases—finding and mitigating bottlenecks, achieving simplicity and
generality of the programming interface, propagating metadata, designing
a system that is robust to hardware failure, and exploiting parallel and
distributed hardware—all at an unprecedented scale. But the challenges
for massive data go beyond the storage, indexing, and querying that have
been the province of classical database systems (and classical search en-
gines) and, instead, hinge on the ambitious goal of inference. Inference
is the problem of turning data into knowledge, where knowledge often is
expressed in terms of entities that are not present in the data per se but
are present in models that one uses to interpret the data. Statistical rigor is
necessary to justify the inferential leap from data to knowledge, and many
difficulties arise in attempting to bring statistical principles to bear on
massive data. Overlooking this foundation may yield results that are not
useful at best, or harmful at worst. In any discussion of massive data and
inference, it is essential to be aware that it is quite possible to turn data into
something resembling knowledge when actually it is not. Moreover, it can
be quite difficult to know that this has happened.
Indeed, many issues impinge on the quality of inference. A major one
is that of “sampling bias.” Data may have been collected according to a
certain criterion (for example, in a way that favors “larger” items over
“smaller” items), but the inferences and decisions made may refer to a dif-
ferent sampling criterion. This issue seems likely to be particularly severe
in many massive data sets, which often consist of many subcollections of
data, each collected according to a particular choice of sampling criterion
and with little control over the overall composition. Another major issue is
“provenance.” Many systems involve layers of inference, where “data” are
not the original observations but are the products of an inferential proce-
dure of some kind. This often occurs, for example, when there are missing
entries in the original data. In a large system involving interconnected infer-
ences, it can be difficult to avoid circularity, which can introduce additional
biases and can amplify noise. Finally, there is the major issue of controlling
error rates when many hypotheses are being considered. Indeed, massive
data sets generally involve growth not merely in the number of individuals
represented (the “rows” of the database) but also in the number of descrip-
tors of those individuals (the “columns” of the database). Moreover, we
are often interested in the predictive ability associated with combinations
of the descriptors; this can lead to exponential growth in the number of
hypotheses considered, with severe consequences for error rates. That is, a
naive appeal to a “law of large numbers” for massive data is unlikely to be