1.2. STATISTICAL LIMITS ON DATA MINING 5
1.2.1 Total Information Awareness
In 2002, the Bush administration put forward a plan to mine all the data it could
find, including credit-card receipts, hotel records, travel da ta, and many other
kinds of information in order to track terrorist activity. This idea naturally
caused great concern among privacy advocates, and the project, called TIA,
or Total Information Awareness, was eventually killed by Congress, although
it is unclear whether the project in fact exists under another name. It is not
the purpose of this book to discus s the difficult issue of the privacy-security
tradeoff. However, the prospect of TIA or a system like it does raise technical
questions about its feasibility and the realism of its assumptions.
The concern raised by many is that if you look at so much data, and you
try to find within it activities that look like terrorist behavior, are you not
going to find many innocent activities — or even illicit activities that are not
terrorism — that will result in visits from the police and maybe worse than
just a visit? The answer is that it all depends on how narrowly you define the
activities that you look for. Statisticians have se e n this problem in many guises
and have a theory, which we introduce in the next section.
1.2.2 Bonferroni’s Principle
Suppose you have a certain amount of da ta, and you look for events of a cer-
tain typ e within that data. Yo u can expect events of this type to occur, even if
the data is co mpletely random, and the number of occurrences of these events
will grow as the size o f the data grows. These occurrences are “bogus,” in the
sense that they have no cause other than that random data will always have
some number of unusual features that look significant but aren’t. A theo rem
of statistics, known as the Bonferroni correction gives a statistically sound way
to avoid most of these bogus positive responses to a search through the data.
Without going into the statistical details, we offer an informal version, Bon-
ferroni’s principle, that helps us avoid treating ra ndom occurrences as if they
were real. Calculate the expected number of occurrences of the e vents you are
looking for, on the assumption that data is random. If this number is signifi-
cantly larger than the number of real instances you hop e to find, then you must
exp ect almost anything you find to be bogus, i.e., a statistical artifact rather
than evidence of what you are looking for. This obse rvation is the informal
statement of Bonferroni’s principle.
In a situation like searching for terrorists, where we expect that there are
few terr orists operating at any one time, Bonferroni’s principle says that we
may o nly detect terrorists by looking for events that are so rare that they are
unlikely to occur in random data. We shall give an extended example in the
next section.