数据挖掘：社会科学入门指南

需积分: 10 120 浏览量更新于2024-07-20 1 收藏 28.31MB PDF 举报

"Data.Mining.for.the.Social.Sciences.An.Introduction" 是一本由Paul Attewell和David B. Monaghan合著的书籍，它为社会科学领域的数据挖掘提供了一个简单且易理解的入门指南。在大数据时代，这本书旨在帮助社会科学家理解和应用数据挖掘技术，以发掘大量行为数据中的模式和趋势。本书分为两个部分。第一部分是概念介绍，涵盖了数据挖掘的基本定义（Chapter 1），与传统统计方法的区别（Chapter 2），数据挖掘的一般策略（Chapter 3）以及数据挖掘项目的关键阶段（Chapter 4）。这部分内容旨在建立对数据挖掘基本概念的理解，并为后续的分析方法奠定基础。第二部分是工作实例，通过一系列章节展示了如何进行实际的数据分析。作者详细讲解了如何准备训练和测试数据集（Chapter 5），变量选择工具（Chapter 6），创建新变量的方法，如分箱和决策树（Chapter 7），特征提取（Chapter 8），分类器（Chapter 9）以及分类树（Chapter 10）。此外，还介绍了神经网络（Chapter 11）、聚类分析（Chapter 12）、潜在类别分析和混合模型（Chapter 13）以及关联规则（Chapter 14）。这些章节提供了实用的演示，展示如何利用各种统计软件包进行数据分析。书中的每个章节都致力于消除数据挖掘过程的神秘感，讨论各种方法的优点和局限性，以帮助社会科学家选择最适合他们研究问题的工具。通过这种方式，本书不仅传授了数据挖掘技术，还鼓励社会科学家将这些方法融入到他们的研究工具箱中。这本书对于那些希望进入大数据分析领域的社会科学家特别有用，它不仅提供了理论背景，还提供了实践经验，使读者能够应用这些技术解决实际问题。通过学习本书，读者将能够更好地理解如何在社会科学领域运用数据挖掘，从而推动更深入、更有效的研究。

Data mining (DM) is the name given to a variety of computer-intensive techniques for

discovering structure and for analyzing patterns in data. Using those patterns, DM can

create predictive models, or classify things, or identify di erent groups or clusters of cases

within data. Data mining and its close cousins machine learning and predictive analytics

are already widely used in business and are starting to spread into social science and other

areas of research.

A partial list of current data mining methods includes:

association rules

recursive partitioning or decision trees, including CART (classi cation and

regression trees) and CHAID (chi-squared automatic interaction detection),

boosted trees, forests, and bootstrap forests

multi-layer neural network models and “deep learning” methods

naive Bayes classi ers and Bayesian networks

clustering methods, including hierarchical, k-means, nearest neighbor, linear

and nonlinear manifold clustering

support vector machines

“soft modeling” or partial least squares latent variable modeling

DM is a young area of scholarship, but it is growing very rapidly. As we speak, new meth-

ods are appearing, old ones are being modi ed, and strategies and skills in using these

WHAT IS DATA MINING?

Attewell - 9780520280977.indd 3Attewell - 9780520280977.indd 3 21/02/15 7:25 PM21/02/15 7:25 PM

•

CHAPTER ONE

methods are accumulating. The potential and importance of DM are becoming widely

recognized. In just the last two years the National Science Foundation has poured millions

of dollars into new research initiatives in this area.

DM methods can be applied to quite di erent domains, for example to visual data, in

reading handwriting or recognizing faces within digital pictures. DM is also being

used to analyze texts—for example to classify the content of scienti c papers or other

documents—hence the term text mining. In addition, DM analytics can be applied to

digitized sound, to recognize words in phone conversations, for example. In this book,

however, we focus on the most common domain: the use of DM methods to analyze

quantitative or numerical data.

Miners look for veins of ore and extract these valuable parts from the surrounding rock.

By analogy, data mining looks for patterns or structure in data. But what does it mean to

say that we look for structure in data? Think of a computer screen that displays thousands

of pixels, points of light or dark. Those points are raw data. But if you scan those pixels by

eye and recognize in them the shapes of letters and words, then you are  nding structures

in the data—or, to use another metaphor, you are turning data into information.

The equivalent to the computer screen for numerical data is a spreadsheet or matrix,

where each column represents a single variable and each row contains data for a di erent

case or person. Each cell within the spreadsheet contains a speci c value for one person

on one particular variable.

How do you recognize patterns or regularities or structures in this kind of raw numer-

ical data? Statistics provides various ways of expressing the relations between the col-

umns and rows of data in a spreadsheet. The most familiar one is a correlation matrix.

Instead of repeating the raw data, with its thousands of observations and dozens of vari-

ables, a correlation matrix represents just the relations between each variable and each

other variable. It is a summary, a simpli cation of the raw data.

Few of us can read a correlation matrix easily, or recognize a meaningful pattern in it,

so we typically go through a second step in looking for structures in numerical data. We

create a model that summarizes the relations in the correlation matrix. An ordinary least

squares (OLS) regression model is one common example. It translates a correlation

matrix into a much smaller regression equation that we can more easily understand and

interpret.

A statistical model is more than just a summary derived from raw data, though. It

is also a tool for prediction, and it is this second property that makes DM especially use-

ful. Banks accumulate huge databases about customers, including records of who

defaulted on loans. If bank analysts can turn those data into a model to accurately predict

who will default on a loan, then they can reject the riskiest new loan applications

and avoid losses. If Amazon.com can accurately assess your tastes in books, based on

your previous purchases and your similarity to other customers, and then tempt you

with a well-chosen book recommendation, then the company will make more pro t. If a

Attewell - 9780520280977.indd 4Attewell - 9780520280977.indd 4 21/02/15 7:25 PM21/02/15 7:25 PM

WHAT IS DATA MINING?

•

physician can obtain an NMR scan of cell tissue and predict from that data whether a

tumor is likely to be malignant or benign, then the doctor has a powerful tool at her

disposal.

Our world is awash with digital data. By  nding patterns in data, especially patterns

that can accurately predict important outcomes, DM is providing a very valuable service.

Accurate prediction can inform a decision and lead to an action. If that cell tissue is most

likely malignant, then one should schedule surgery. If that person’s predicted risk of

default is high, then don’t approve the loan.

But why do we need DM for this? Wouldn’t traditional statistical methods ful ll the

same function just as well?

Conventional statistical methods do provide predictive models, but they have signi -

cant weaknesses. DM methods o er an alternative to conventional methods, in some

cases a superior alternative that is less subject to those problems. We will later enumer-

ate several advantages of DM, but for now we point out just the most obvious one. DM

is especially well suited to analyzing very large datasets with many variables and/or many

cases—what’s known as Big Data.

Conventional statistical methods sometimes break down when applied to very large

datasets, either because they cannot handle the computational aspects, or because they

face more fundamental barriers to estimation. An example of the latter is when a dataset

contains more variables than observations, a combination that conventional regression

models cannot handle, but that several DM methods can.

DM not only overcomes certain limitations of conventional statistical methods, it also

helps transcend some human limitations. A researcher faced with a dataset containing

hundreds of variables and many thousands of cases is likely to overlook important fea-

tures of the data because of limited time and attention. It is relatively easy, for example,

to inspect a half-dozen variables to decide whether to transform any of them, to make

them more closely resemble a bell curve or normal distribution. However, a human

analyst will quickly become overwhelmed trying to decide the same thing for hundreds

of variables. Similarly, a researcher may wish to examine statistical interactions between

predictors in a dataset, but what happens when that person has to consider interactions

between dozens of predictors? The number of potential combinations grows so large that

any human analyst would be stymied.

DM techniques help in this situation because they partly “automate” data analysis by

identifying the most important predictors among a large number of independent varia-

bles, or by transforming variables automatically into more useful distributions, or by

detecting complex interactions among variables, or by discovering what forms of hetero-

geneity are prevalent in a dataset. The human researcher still makes critical decisions,

but DM methods leverage the power of computers to compare numerous alternatives

and identify patterns that human analysts might easily overlook (Larose 2005; McKinsey

Global Institute 2011; Nisbet, Elder, and Miner 2009).

Attewell - 9780520280977.indd 5Attewell - 9780520280977.indd 5 21/02/15 7:25 PM21/02/15 7:25 PM

•

CHAPTER ONE

It follows that DM is very computationally intensive. It uses computer power to scour

data for patterns, to search for “hidden” interactions among variables, and to try out

alternative methods or combine models to maximize its accuracy in prediction.

THE GOALS OF THIS BOOK

There are many books on DM, so what’s special about this one? One can think of the

literature on DM as a layer cake. The bottom layer deals with the mathematical concepts

and theorems that underlie DM. These are fundamental but are di cult to understand.

This book doesn’t try to operate at that technically demanding level, but interested read-

ers can get a taste by looking at the online version of the classic text by Hastie, Tibshirani,

and Friedman (2009): The Elements of Statistical Learning: Data Mining, Inference, and

Prediction (there is a free version at www.stanford.edu/~hastie/local.ftp/Springer/OLD//

ESLII_print4.pdf).

Moving upward, the next layer of the DM literature covers computer algorithms that

apply those mathematical concepts to data. Critical issues here are how to minimize the

time needed to perform various mathematical and matrix operations and choosing e -

cient computational strategies that can analyze data one case at a time or make the

minimum number of passes through a large dataset. Fast, e cient computer strategies

are especially critical when analyzing big data containing hundreds of thousands of

observations. An ine cient computer program might run for days to accomplish a single

analysis. This book doesn’t go into the algorithmic level either. Interested readers can

consult the books by Tan, Steinbach, and Kumar (2005) and Witten, Eibe, and Hall (2011)

listed in the bibliography.

At the top layer of the DM literature one  nds books about the use of DM. Several are

exhortations to managers and employees to revolutionize their  rms by embracing DM

or “business analytics” as a business strategy. That’s not our goal, however. What this

book provides is a brief, nontechnical introduction to DM for people who are interested

in using it to analyze quantitative data but who don’t yet know much about these meth-

ods. Our primary goal is to explain what DM does and how it di ers from more familiar

or established kinds of statistical analysis and modeling, and to provide a sense of DM’s

strengths and weaknesses. To communicate those ideas, this book begins by discussing

DM in general, especially its distinctive perspective on data analysis. Later, it introduces

the main methods or tools within DM.

This book mostly avoids math. It does presume a basic knowledge of conventional

statistics; at a minimum you should know a little about multiple regression and logistic

regression. The second half of this book provides examples of data analyses for each

application or DM tool, walks the reader through the interpretation of the software out-

put, and discusses what each example has taught us. It covers several “tricks” that data

miners use in analyses, and it highlights some pitfalls to avoid, or suggests ways to get

round them.

Attewell - 9780520280977.indd 6Attewell - 9780520280977.indd 6 21/02/15 7:25 PM21/02/15 7:25 PM

WHAT IS DATA MINING?

•

After reading this book you should understand in general terms what DM is and what

a data analyst might use it for. You should be able to pick out appropriate DM tools for

particular tasks and be able to interpret their output. After that, using DM tools is mainly

a matter of practice, and of keeping up with a  eld that is advancing at an extraordinarily

rapid pace.

SOFTWARE AND HARDWARE FOR DATA MINING

Large corporations use custom-written computer programs for their DM applications,

and they run them on fast mainframes or powerful computer clusters. Those are prob-

ably the best computer environments for analyzing big data, but they are out of reach for

most of us. Fortunately, there are several products that combine multiple DM tools into

a single package or software suite that runs under Windows on a personal computer.

JMP Pro (pronounced “jump pro”) was developed by the company that sells the SAS

suite of statistical software. You can download a free trial version, and the company pro-

vides online tutorials and other learning tools. JMP is relatively easy to use, employing a

point-and-click approach. However, it lacks some of the more recent DM analytical tools.

SPSS (Statistical Package for the Social Sciences), now owned by IBM, is one of the

oldest and most established software products for analyzing data using conventional

statistical methods such as regression, cross-tabulation, t-tests, factor analysis, and so on.

In its more recent versions (20 and above), the “professional” version of SPSS includes

several data mining methods, including neural network models, automated linear mod-

els, and clustering. These are easy to use because they are point-and-click programs and

their inputs and outputs are well designed. This may be the best place for a beginner to

get a taste of some DM methods.

A more advanced data mining package called IBM SPSS Modeler includes a much

larger choice of DM methods. This program is more complicated to learn than regular

SPSS: one has to arrange various icons into a process and set various options or param-

eters. However, Modeler provides a full range of DM tools.

There are other commercial software products for PCs that include some DM tools

within their general statistics software. Among these, MathWorks MATLAB o ers data

mining within two specialized “toolboxes”: Statistics and Neural Networks. StatSoft’s

Statistica package includes an array of DM programs; and XLMiner is a commercial add-

on for data mining that works with Microsoft’s Excel spreadsheet program.

Beyond the commercial software, there are several free data mining packages for PCs.

RapidMiner is an extensive suite of DM programs developed in Germany. It has

recently incorporated programs from the Weka DM suite (see below), and also many DM

programs written in the R language. As a result, RapidMiner o ers by far the largest

variety of DM programs currently available in any single software product. It is also free

(see http://rapid-i.com for information). The software takes considerable time to master;

it uses a  owchart approach that involves dragging icons onto a workspace and linking

Attewell - 9780520280977.indd 7Attewell - 9780520280977.indd 7 21/02/15 7:25 PM21/02/15 7:25 PM

剩余264页未读，继续阅读

ramissue

粉丝: 354
资源: 1487

数据挖掘：社会科学入门指南

Data.Mining.Concepts.and.Techniques.2nd.Ed 配套 PPT 9 章

CRC.Social.Big.Data.Mining.149871093X

Data.Mining_Practical.Machine.Learning

Classes ‘data.table’ and 'data.frame'

output = model(data.x, data.edge_index)[data.train_mask] 这行出现了IndexError: The shape of the mask [2277, 10] at index 1 does not match the shape of the indexed tensor [2277, 5] at index 1问题怎么解决

write code convert data.json file into data.csv file using python

最新资源