A driving force behind KDD is the database
field (the second D in KDD). Indeed, the
problem of effective data manipulation when
data cannot fit in the main memory is of fun-
damental importance to KDD. Database tech-
niques for gaining efficient data access,
grouping and ordering operations when ac-
cessing data, and optimizing queries consti-
tute the basics for scaling algorithms to larger
data sets. Most data-mining algorithms from
statistics, pattern recognition, and machine
learning assume data are in the main memo-
ry and pay no attention to how the algorithm
breaks down if only limited views of the data
are possible.
A related field evolving from databases is
data warehousing, which refers to the popular
business trend of collecting and cleaning
transactional data to make them available for
online analysis and decision support. Data
warehousing helps set the stage for KDD in
two important ways: (1) data cleaning and (2)
data access.
Data cleaning: As organizations are forced
to think about a unified logical view of the
wide variety of data and databases they pos-
sess, they have to address the issues of map-
ping data to a single naming convention,
uniformly representing and handling missing
data, and handling noise and errors when
possible.
Data access: Uniform and well-defined
methods must be created for accessing the da-
ta and providing access paths to data that
were historically difficult to get to (for exam-
ple, stored offline).
Once organizations and individuals have
solved the problem of how to store and ac-
cess their data, the natural next step is the
question, What else do we do with all the da-
ta? This is where opportunities for KDD natu-
rally arise.
A popular approach for analysis of data
warehouses is called online analytical processing
(OLAP), named for a set of principles pro-
posed by Codd (1993). OLAP tools focus on
providing multidimensional data analysis,
which is superior to
SQL in computing sum-
maries and breakdowns along many dimen-
sions. OLAP tools are targeted toward simpli-
fying and supporting interactive data analysis,
but the goal of KDD tools is to automate as
much of the process as possible. Thus, KDD is
a step beyond what is currently supported by
most standard database systems.
Basic Definitions
KDD is the nontrivial process of identifying
valid, novel, potentially useful, and ultimate-
and still run efficiently, how results can be in-
terpreted and visualized, and how the overall
man-machine interaction can usefully be
modeled and supported. The KDD process
can be viewed as a multidisciplinary activity
that encompasses techniques beyond the
scope of any one particular discipline such as
machine learning. In this context, there are
clear opportunities for other fields of AI (be-
sides machine learning) to contribute to
KDD. KDD places a special emphasis on find-
ing understandable patterns that can be inter-
preted as useful or interesting knowledge.
Thus, for example, neural networks, although
a powerful modeling tool, are relatively
difficult to understand compared to decision
trees. KDD also emphasizes scaling and ro-
bustness properties of modeling algorithms
for large noisy data sets.
Related AI research fields include machine
discovery, which targets the discovery of em-
pirical laws from observation and experimen-
tation (Shrager and Langley 1990) (see Kloes-
gen and Zytkow [1996] for a glossary of terms
common to KDD and machine discovery),
and causal modeling for the inference of
causal models from data (Spirtes, Glymour,
and Scheines 1993). Statistics in particular
has much in common with KDD (see Elder
and Pregibon [1996] and Glymour et al.
[1996] for a more detailed discussion of this
synergy). Knowledge discovery from data is
fundamentally a statistical endeavor. Statistics
provides a language and framework for quan-
tifying the uncertainty that results when one
tries to infer general patterns from a particu-
lar sample of an overall population. As men-
tioned earlier, the term data mining has had
negative connotations in statistics since the
1960s when computer-based data analysis
techniques were first introduced. The concern
arose because if one searches long enough in
any data set (even randomly generated data),
one can find patterns that appear to be statis-
tically significant but, in fact, are not. Clearly,
this issue is of fundamental importance to
KDD. Substantial progress has been made in
recent years in understanding such issues in
statistics. Much of this work is of direct rele-
vance to KDD. Thus, data mining is a legiti-
mate activity as long as one understands how
to do it correctly; data mining carried out
poorly (without regard to the statistical as-
pects of the problem) is to be avoided. KDD
can also be viewed as encompassing a broader
view of modeling than statistics. KDD aims to
provide tools to automate (to the degree pos-
sible) the entire process of data analysis and
the statistician’s “art” of hypothesis selection.
Data mining
is a step in
the KDD
process that
consists of ap-
plying data
analysis and
discovery al-
gorithms that
produce a par-
ticular enu-
meration of
patterns
(or models)
over the
data.
Articles
40 AI MAGAZINE