没有合适的资源?快使用搜索试试~ 我知道了~
首页Data preparation for data mining.pdf
Data preparation for data mining.pdf
需积分: 10 184 浏览量
更新于2023-05-22
评论 2
收藏 3.92MB PDF 举报
数据挖掘中的数据预处理 data preparation for data mining
资源详情
资源评论
资源推荐

Data Preparation for Data Mining
Dorian Pyle
Senior Editor: Diane D. Cerra
Director of Production & Manufacturing: Yonie Overton
Production Editor: Edward Wade
Editorial Assistant: Belinda Breyer
Cover Design: Wall-To-Wall Studios
Cover Photograph: © 1999 PhotoDisc, Inc.
Text Design & Composition: Rebecca Evans & Associates
Technical Illustration: Dartmouth Publishing, Inc.
Copyeditor: Gary Morris
Proofreader: Ken DellaPenta
Indexer: Steve Rath
Printer: Courier Corp.
Designations used by companies to distinguish their products are often claimed
as trademarks or registered trademarks. In all instances where Morgan Kaufmann
Publishers, Inc. is aware of a claim, the product names appear in initial capital or all
capital letters. Readers, however, should contact the appropriate companies for more
complete information regarding trademarks and registration.
Morgan Kaufmann Publishers, Inc.
Editorial and Sales Office
340 Pine Street, Sixth Floor
San Francisco, CA 94104-3205
USA
Telephone 415-392-2665
Facsimile 415-982-2665
Email mkp@mkp.com
WWW http://www.mkp.com
Order toll free 800-745-7323
© 1999 by Morgan Kaufmann Publishers, Inc.
All rights reserved

No part of this publication may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means—electronic, mechanical, photocopying, or
otherwise—without the prior written permission of the publisher.
Dedication
T
o my dearly beloved Pat, without whose love, encouragement, and support, this book, and
very much more, would never have come to be

Table of Contents
Data Preparation for Data Mining
Preface
Introduction
Chapter 1
-
Data Exploration as a Process
Chapter 2
-
The Nature of the World and Its Impact on Data Preparation
Chapter 3
-
Data Preparation as a Process
Chapter 4
-
Getting the Data—Basic Preparation
Chapter 5
-
Sampling, Variability, and Confidence
Chapter 6
-
Handling Nonnumerical Variables
Chapter 7
-
Normalizing and Redistributing Variables
Chapter 8
-
Replacing Missing and Empty Values
Chapter 9
-
Series Variables
Chapter 10
-
Preparing the Data Set
Chapter 11
-
The Data Survey
Chapter 12
-
Using Prepared Data
Appendix A
-
Using the Demonstration Code on the CD-ROM
Appendix B
-
Further Reading

Preface
What This Book Is About
This book is about what to do with data to get the mo
st out of it. There is a lot more to that
statement than first meets the eye.
Much information is available today about data warehouses, data mining, KDD, OLTP,
OLAP, and a whole alphabet soup of other acronyms that describe techniques and
methods of storing, accessing, visualizing, and using data. There are books and
magazines about building models for making predictions of all types—fraud, marketing,
new customers, consumer demand, economic statistics, stock movement, option prices,
weather, sociological behavior, traffic demand, resource needs, and many more.
In order to use the techniques, or make the predictions, industry professionals almost
universally agree that one of the most important parts of any such project, and one of the
most time-consuming and difficult, is data preparation. Unfortunately, data preparation
has been much like the weather—
as the old aphorism has it, “Everyone talks about it, but
no one does anything about it.” This book takes a detailed look at the problems in
preparing data, the solutions, and how to use the solutions to get the most out of the
data—whatever you want to use it for. This book tells you what can be done about it,
exactly how it can be done, and what it achieves, and puts a powerful kit of tools directly in
your hands that allows you to do it.
How important is adequate data preparation? After finding the right problem to solve, data
preparation is often the
key to solving the problem. It can easily be the difference between
success and failure, between useable insights and incomprehensible murk, between
worthwhile predictions and useless guesses.
For instance, in one case data carefully prepared for warehousing proved useless for
modeling. The preparation for warehousing had destroyed the useable information content
for the needed mining project. Preparing the data for mining, rather than warehousing,
produced a 550% improvement in model accuracy. In another case, a commercial baker
achieved a bottom-
line improvement approaching $1 million by using data prepared with the
techniques described in this book instead of previous approaches.
Who This Book Is For
This book is written primarily for the computer savvy analyst or modeler who works with
data on a daily basis and who wants to use data mining to get the most out of data. The
type of data the analyst works with is not important. It may be financial, marketing,
business, stock trading, telecommunications, healthcare, medical, epidemiological,

genomic, chemical, process, meteorological, marine, aviation, physical, credit, insurance,
retail, or any type of data requ
iring analysis. What is important is that the analyst needs to
get the most information out of the data.
At a second level, this book is also intended for anyone wh
o needs to understand the issues
in data preparation, even if they are not directly involved in preparing or working with data.
Reading this book will give anyone who uses analyses provided from an analyst’s work a
much better understanding of the results
and limitations that the analyst works with, and a far
deeper insight into what the analyses mean, where they can be used, and what can be
reasonably expected from any analysis.
Why I Wrote It
There are many good books available today that discuss how to collect data, particularly
in government and business. Simply look for titles about databases and data
warehousing. There are many equally good books about data
mining that discuss tools
and algorithms. But few, if any books, address what to do with the “dirty data” after it is
collected and before exploring it with a data mining tool. Yet this part of the process is
critical.
I wrote this book to address that gap in the process between identifying data and building
models. It will take you from the point where data has been identified in some form or
other, if not assembled. It will walk you through the process of identifying an appropriate
problem, relating the data back to the world from which it was collected, assembling the
data into mineable form, discovering problems with the data, fixing the problems, and
discovering what is in the data—that is, whether continuing with mining will deliver what
you need. It walks you through the whole process, starting with data discovery, and
deposits you on the very doorstep of building a data-mined model.
This is not an easy journey, but it is one that I have trodden many times in many projects.
There is a “beaten path,” and my express purpose in writing this book is to show exactly
where the p
ath leads, why it goes where it does, and to provide tools and a map so that you
can tread it again on your own when you need to.
Special Features
A CD-ROM acco
mpanies the book. Preparing data requires manipulating it and looking at
it in various ways. All of the actual data manipulation techniques that are conceptually
described in the book, mainly in Chapters 5 through 8 and 10, are illustrated by C
programs. F
or ease of understanding, each technique is illustrated, so far as possible, in a
separate, well-commented C source file. If compiled as an integrated whole, these
provide an automated data preparation tool.
The CD-ROM also includes demonstration versions of other tools mentioned, and useful
剩余465页未读,继续阅读















安全验证
文档复制为VIP权益,开通VIP直接复制

评论0