20
them; they usually do not plan on updating or adding fresh data to them. The poor
management of these data often leads to their misplacement, thereby generating dark
data—data that is suspected to exist or ought to exist but is difficult or impossible to find.
The problem of dark data is real and prevalent in the myriad of small, locally collected
data-sets. The utter lack of central management of data in the tail of the data size
distribution invariably causes these sets of data to be forgotten. In spite of the fact that
most data is not big, it is primarily the Big Data sets that exhibit exponential growth,
propelling the number of bytes created by humans moving upwards daily.
Big Data differs substantially from other data not only in its size and velocity, but also in
its scope and density. Big Data is large in scope, that is, it is created by everyone and by
itself and thus is informative about a wide audience. This characteristic makes it very
useful for studying populations, as the inferences we can make generalize to large groups
of people. Compare that with, say, opinions gleaned from a focus group or small survey.
These opinions, while highly accurate and easy to obtain, may or may not be reflective of
the views of the wider public. Thus, Big Data's scope is a real benefit, at least in terms of
generalizing evidence to wide populations.
However, Big Data's density is fairly low. By density, we mean the degree to which Big
Data, and especially social data, is directly applicable to questions we want to answer.
Again, a comparison to small data is useful. Prior to the explosion of Big Data and the
proliferation of tools used to harness it, companies or political campaigns largely used
focus groups or surveys to obtain information about public sentiments relevant to their
endeavors. The focus groups and surveys furnished organizations with data that was
directly applicable to their purpose, and often this data would already be measured with
meaningful units. For instance, respondents would describe how much they liked or
disliked a new product, or rate a political candidate's TV appearances from 1 to 5.
Compare that with social data, where opinion-laden text is buried among terabytes of
unrelated information and comes in a form that must be subjected to analysis just to
generate a measure of the opinion. Thus, low density of big social data presents unique
challenges to organizations trying to utilize opinion data.
The size and scope of Big Data helps us overcome some of the hurdles caused by its low
density. For instance, even though each unique piece of social data may have little
applicability to our particular task, these small bits of information quickly become useful
as we aggregate them across thousands or millions of people. Like the proverbial bundle
of sticks—none of which could support inferences alone—when tied together, these small
bits of information can be a powerful tool for understanding the opinions of the online
populace.