数据挖掘概念与技术第二版课后答案详解

需积分: 12 195 浏览量更新于2024-07-24 收藏 800KB PDF 举报

"数据挖掘概念与技术（第二版）答案" 本书是《数据挖掘：概念与技术》第二版的课后习题解答，由Jiawei Han和Micheline Kamber编写，他们来自伊利诺伊大学厄巴纳-香槟分校。这本书是数据挖掘领域的经典教材，涵盖了从数据预处理到应用趋势的广泛主题。 1. 数据挖掘概述数据挖掘是一种从大量数据中发现有价值知识的过程。它包括模式识别、关联规则学习和分类等方法。数据挖掘不仅仅涉及简单的数据分析，还包括高级统计分析、机器学习和人工智能等复杂技术。书中第一章的习题旨在帮助读者理解数据挖掘的基本概念及其在不同领域中的应用。 2. 数据预处理数据预处理是数据挖掘的重要步骤，包括数据清洗（去除噪声和不一致性）、数据集成（合并来自多个源的数据）、数据转换（如规范化和归一化）以及数据规约（减少数据量但保持其信息含量）。这一章的习题可能涉及到如何处理缺失值、异常值以及如何进行特征选择。 3. 数据仓库和OLAP技术数据仓库是为企业决策提供单一视图的大型数据存储系统，而在线分析处理（OLAP）则支持多维度数据分析。第三章讨论了数据仓库的构建、OLAP操作（如切片、 dice、钻取和旋转）及其在商业智能中的应用。 4. 数据立方体计算与数据泛化数据立方体是数据仓库中的一个概念，用于快速汇总多维数据。数据泛化是数据匿名化的一种方式，用于保护敏感信息。第四章的习题可能涉及如何构造数据立方体、优化OLAP查询以及如何实施有效的数据泛化策略。 5. 模式挖掘、关联规则和相关性第五章讲解了频繁模式挖掘、关联规则学习（如Apriori算法）和序列模式挖掘。这些技术常用于市场篮子分析和预测用户行为。 6. 分类与预测分类是根据已知属性将数据划分为预定义类别的过程，预测则是基于历史数据对未知结果的估计。第六章涵盖了决策树、贝叶斯分类、神经网络和支持向量机等方法，并提供了相关习题来深化理解。 7. 聚类分析聚类是无监督学习的一种形式，通过寻找相似性的对象来划分数据集。第七章介绍了一种发现自然群体的方法，如K-means、层次聚类和DBSCAN等，并提供了练习来实践这些算法。 8. 流数据、时间序列和序列数据挖掘随着实时数据的增加，第八章探讨了如何在数据流、时间序列和序列数据上进行挖掘，如滑动窗口技术和演变聚类。 9. 图挖掘与社会网络分析第九章讨论了图数据结构的挖掘，包括社区检测、社交网络中的影响力传播以及多关系数据挖掘。 10. 对象、空间、多媒体、文本和Web数据挖掘第十章涵盖了非结构化数据的挖掘，如地理位置信息、图像、音频、文本和网页数据，强调了特定于这些数据类型的挖掘技术。 11. 数据挖掘的应用与趋势最后一章总结了数据挖掘在各个领域的应用，如医疗、金融、电子商务等，并探讨了未来的发展方向，如深度学习和大数据分析。每章末尾的习题设计旨在巩固所学概念，提高读者解决实际问题的能力，是理解和掌握数据挖掘技术的关键实践部分。通过解答这些习题，读者可以深化对数据挖掘理论和方法的理解，并为实际项目做好准备。

14 CHAPTER 2. DATA PREPROCESSING

2.3. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the

characterization of data dispersion, and discuss how they can be computed eﬃciently in large databases.

Answer:

Data dispersion, also known as variance analysis, is the degree to which numeric data tend to spread and can

be characterized by such statistical measures as mean deviation, measures of skewness, and the coeﬃcient

of variation.

The mean deviation is deﬁned as the arithmetic mean of the absolute deviations from the means and is

calculated as:

mean deviation =

i=1

|x − ¯x|

, (2.1)

where ¯x is the arithmetic mean of the values and N is the total number of values. This value will be greater

for distributions with a larger spread.

A common measure of skewness is:

¯x − mode

, (2.2)

which indicates how far (in standard deviations, s) the mean (¯x) is from the mode and whether it is greater

or less than the mode.

The coeﬃcient of variation is the standard deviation expressed as a percentage of the arithmetic mean

and is calculated as:

coeﬃcient of variation =

¯x

× 100 (2.3)

The variability in groups of observations with widely diﬀering means can be compared using this measure.

Note that all of the input values used to calculate these three statistical measures are algebraic measures.

Thus, the value for the entire database can be eﬃciently calculated by partitioning the database, computing

the values for each of the separate partitions, and then merging theses values into an algebraic equation

that can be used to calculate the value for the entire database.

The measures of dispersion described here were obtained from: Statistical Methods in Research and Pro duc-

tion, fourth ed., edited by Owen L. Davies and Peter L. Goldsmith, Hafner Publishing Company, NY:NY,

1972.

2.4. Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in

increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45,

46, 52, 70.

(a) What is the mean of the data? What is the median?

(b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal, trimodal, etc.).

(d) Can you ﬁnd (roughly) the ﬁrst quartile (Q1) and the third quartile (Q3) of the data?

(e) Give the ﬁve-number summary of the data.

(f) Show a boxplot of the data.

(g) How is a quantile-quantile plot diﬀerent from a quantile plot?

2.8. EXERCISES 15

Answer:

(a) What is the mean of the data? What is the median?

The (arithmetic) mean of the data is: ¯x =

i=1

= 809/27 = 30 (Equation 2.1). The median (middle

value of the ordered set, as the number of values in the set is odd) of the data is: 25.

(b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal, trimodal, etc.).

This data set has two values that occur with the same highest frequency and is, therefore, bimodal.

The modes (values occurring with the greatest frequency) of the data are 25 and 35.

The midrange (average of the largest and smallest values in the data set) of the data is: (70 + 13)/2 =

41.5

(d) Can you ﬁnd (roughly) the ﬁrst quartile (Q1) and the third quartile (Q3) of the data?

The ﬁrst quartile (corresponding to the 25th percentile) of the data is: 20. The third quartile (corre-

sponding to the 75th percentile) of the data is: 35.

(e) Give the ﬁve-number summary of the data.

The ﬁve number summary of a distribution consists of the minimum value, ﬁrst quartile, median value,

third quartile, and maximum value. It provides a good summary of the shape of the distribution and

for this data is: 13, 20, 25, 35, 70.

(f) Show a boxplot of the data. (Omitted here. Please refer to Figure 2.3 of the textbook.)

(g) How is a quantile-quantile plot diﬀerent from a quantile plot?

A quantile plot is a graphical method used to show the approximate percentage of values below or

equal to the independent variable in a univariate distribution. Thus, it displays quantile information

for all the data, where the values measured for the independent variable are plotted against their

corresponding quantile.

A quantile-quantile plot however, graphs the quantiles of one univariate distribution against the corre-

sponding quantiles of another univariate distribution. Both axes display the range of values measured

for their corresponding distribution, and points are plotted that correspond to the quantile values of

the two distributions. A line (y = x) can be added to the graph along with points representing where

the ﬁrst, second and third quantiles lie to increase the graph’s informational value. Points that lie

above such a line indicate a correspondingly higher value for the distribution plotted on the y-axis

than for the distribution plotted on the x-axis at the same quantile. The opposite eﬀect is true for

points lying b elow this line.

2.5. In many applications, new data sets are incrementally added to the existing large data sets. Thus an

important consideration for computing descriptive data summary is whether a measure can be computed

eﬃciently in incremental manner. Use count, standard deviation, and median as examples to show that a

distributive or algebraic measure facilitates eﬃcient incremental computation, whereas a holistic measure

does not.

Answer:

• Count: The current count can be stored as a value, and when x number of new values are added,

we can easily update count with count + x. This is a distributive measure and is easily updated for

incremental additions.

• Standard deviation: If we store the sum of the squared existing values and the count of the existing

values, we can easily generate the new standard deviation using the formula provided in the book.

We simply need to calculate the squared sum of the new numbers, add that to the existing squared

sum, update the count of the numbers, and plug that into the calculation to obtain the new standard

deviation. All of this is done without looking at the whole data set and is thus easy to compute.

16 CHAPTER 2. DATA PREPROCESSING

• Median: To accurately calculate the median, we have to look at every value in the dataset. When we

add a new value or values, we have to sort the new set and then ﬁnd the median based on that new

sorted set. This is much harder and thus makes the incremental addition of new values diﬃcult.

2.6. In real-world data, tuples with missing values for some attributes are a common occurrence. Describe

various methods for handling this problem.

Answer:

The various methods for handling the problem of missing values in data tuples include:

(a) Ignoring the tuple: This is usually done when the class label is missing (assuming the mining task

involves classiﬁcation or description). This method is not very eﬀective unless the tuple contains several

attributes with missing values. It is especially poor when the percentage of missing values per attribute

varies considerably.

(b) Manually ﬁlling in the missing value: In general, this approach is time-consuming and may not

be a reasonable task for large data sets with many missing values, especially when the value to be ﬁlled

in is not easily determined.

the same constant, such as a label like “Unknown,” or −∞. If missing values are replaced by, say,

“Unknown,” then the mining program may mistakenly think that they form an interesting concept,

since they all have a value in common — that of “Unknown.” Hence, although this method is simple,

it is not recommended.

(d) Using the attribute mean for quantitative (numeric) values or attribute mode for categor-

ical (nominal) values: For example, suppose that the average income of AllElectronics customers is

$28,000. Use this value to replace any missing values for income.

(e) Using the attribute mean for quantitative (numeric) values or attribute mode for categor-

ical (nominal) values, for all samples belonging to the same class as the given tuple: For

example, if classifying customers according to credit risk, replace the missing value with the average

income value for customers in the same credit risk category as that of the given tuple.

(f) Using the most probable value to ﬁll in the missing value: This may be determined with

regression, inference-based tools using Bayesian formalism, or decision tree induction. For example,

using the other customer attributes in the data set, we can construct a decision tree to predict the

missing values for income.

2.7. Using the data for age given in Exercise 2.4, answer the following.

(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps.

Comment on the eﬀect of this technique for the given data.

(b) How might you determine outliers in the data?

Answer:

(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps.

Comment on the eﬀect of this technique for the given data.

The following steps are required to smooth the above data using smoothing by bin means with a bin

depth of 3.

• Step 1: Sort the data. (This step is not required here as the data are already sorted.)

• Step 2: Partition the data into equal-frequency bins of size 3.

2.8. EXERCISES 17

Bin 1: 13, 15, 16 Bin 2: 16, 19, 20 Bin 3: 20, 21, 22

Bin 4: 22, 25, 25 Bin 5: 25, 25, 30 Bin 6: 33, 33, 35

Bin 7: 35, 35, 35 Bin 8: 36, 40, 45 Bin 9: 46, 52, 70

• Step 3: Calculate the arithmetic mean of each bin.

• Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin.

Bin 1: 142/3, 142/3, 142/3 Bin 2: 181/3, 181/3, 181/3 Bin 3: 21, 21, 21

Bin 4: 24, 24, 24 Bin 5: 262/3, 262/3, 262/3 Bin 6: 332/3, 332/3, 332/3

Bin 7: 35, 35, 35 Bin 8: 401/3, 401/3, 401/3 Bin 9: 56, 56, 56

(b) How might you determine outliers in the data?

Outliers in the data may be detected by clustering, where similar values are organized into groups, or

“clusters”. Values that fall outside of the set of clusters may be considered outliers. Alternatively, a

combination of computer and human inspection can be used where a predetermined data distribution

is implemented to allow the computer to identify possible outliers. These possible outliers can then be

veriﬁed by human inspection with much less eﬀort than would be required to verify the entire initial

data set.

Other methods that can b e used for data smoothing include alternate forms of binning such as smooth-

ing by bin medians or smoothing by bin boundaries. Alternatively, equal-width bins can be used to

implement any of the forms of binning, where the interval range of values in each bin is constant.

Methods other than binning include using regression techniques to smooth the data by ﬁtting it to a

function such as through linear or multiple regression. Classiﬁcation techniques can be used to imple-

ment concept hierarchies that can smooth the data by rolling-up lower level concepts to higher-level

concepts.

2.8. Discuss issues to consider during data integration.

Answer:

Data integration involves combining data from multiple sources into a coherent data store. Issues that must

be considered during such integration include:

• Schema integration: The metadata from the diﬀerent data sources must be integrated in order to

match up equivalent real-world entities. This is referred to as the entity identiﬁcation problem.

• Handling redundant data: Derived attributes may be redundant, and inconsistent attribute naming

may also lead to redundancies in the resulting data set. Duplications at the tuple level may occur and

thus need to be detected and resolved.

• Detection and resolution of data value conﬂicts: Diﬀerences in representation, scaling, or encod-

ing may cause the same real-world entity attribute values to diﬀer in the data sources being integrated.

2.9. Suppose a hospital tested the age and body fat data for 18 randomly selected adults with the following

result

age 23 23 27 27 39 41 47 49 50

%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2

age 52 54 54 56 57 58 58 60 61

%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

(a) Calculate the mean, median and standard deviation of age and %fat.

(b) Draw the boxplots for age and %fat.

剩余134页未读，继续阅读

rkasdfg

粉丝: 0

数据挖掘概念与技术第二版课后答案详解

数据挖掘概念与技术第二版课后答案解析

数据挖掘：概念与技术第二版课后答案解析

数据挖掘概念与技术-2nd版课后答案解析

数据挖掘概念与技术 英文版第二版 课后答案

韩家炜数据挖掘概念与技术（第二版）中英文+课后习题答案中英文合集

数据挖掘与概念与技术（第二版）韩家炜，中英答案

数据挖掘-概念与技术[英文第二版+课后答案]

数据挖掘：概念与技术（第二版，中英文，习题答案）

数据挖掘概念与技术详解（第二版）习题答案

韩家炜《数据挖掘概念与技术》第二版课后习题答案详解

最新资源

数据挖掘概念与技术英文版第二版课后答案