《数据挖掘概念与技术(英文第2版)》课后习题答案解析

需积分: 12 46 浏览量更新于2024-07-29 收藏 800KB PDF 举报

《数据挖掘：概念与技术（英文第2版）》是一本由Jiawei Han和Micheline Kamber编著的专业教材，针对数据挖掘领域的理论和技术进行了详尽讲解。该书是为伊利诺伊大学厄巴纳-香槟分校的学生和教师设计的，版权方Morgan Kaufmann于2006年出版。书中内容涵盖广泛，包括数据预处理、数据仓库与OLAP技术概述、数据立方体计算与数据概括、频繁模式挖掘、关联性和相关性分析、分类与预测、聚类分析、流数据、时间序列和序列数据挖掘、图矿、社交网络分析以及多关系数据挖掘等多个核心主题。在课程中，作者通过课后习题的形式帮助读者巩固理论知识和实践技能。例如，在第一章“引论”中，习题1.1探讨了数据挖掘的定义，要求学生思考其定义并关注以下方面：(a)数据挖掘是否涵盖了哪些具体任务，如模式识别、异常检测和预测等；(b)数据挖掘与传统统计分析有何区别；(c)它如何应用于商业智能、机器学习和人工智能等领域。第二章至第十章分别深入剖析了数据预处理的重要性，数据仓库和在线分析处理工具的应用，数据立方体的构建及其在数据概括中的作用，频繁模式和关联规则的挖掘，分类与预测模型的建立，聚类方法的原理和应用，以及如何处理实时流数据、时间序列数据以及复杂的数据结构如图形和文本。每一章末尾都配有详细的习题，旨在促使读者通过实践来理解和掌握各种数据挖掘技术和算法。此外，书中还介绍了数据挖掘在实际应用中的趋势和发展，如在物联网、社交媒体和大数据时代的最新挑战和机遇。最后的练习题11.7可能涉及到综合应用所学知识，分析现实世界中的数据挖掘问题，或者讨论未来数据挖掘技术的发展方向。通过阅读这本教材并完成课后习题，读者将建立起扎实的数据挖掘基础，了解并掌握关键技术和工具，为在IT行业中成为一名有效的问题发现者和解决方案提供者打下坚实的基础。对于希望进一步提升数据处理能力、探索潜在信息模式和预测趋势的人员，这本书是不可或缺的学习资源。

14 CHAPTER 2. DATA PREPROCESSING

2.3. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the

characterization of data dispersion, and discuss how they can be computed eﬃciently in large databases.

Answer:

Data dispersion, also known as variance analysis, is the degree to which numeric data tend to spread and can

be characterized by such statistical measures as mean deviation, measures of skewness, and the coeﬃcient

of variation.

The mean deviation is deﬁned as the arithmetic mean of the absolute deviations from the means and is

calculated as:

mean deviation =

i=1

|x − ¯x|

, (2.1)

where ¯x is the arithmetic mean of the values and N is the total number of values. This value will be greater

for distributions with a larger spread.

A common measure of skewness is:

¯x − mode

, (2.2)

which indicates how far (in standard deviations, s) the mean (¯x) is from the mode and whether it is greater

or less than the mode.

The coeﬃcient of variation is the standard deviation expressed as a percentage of the arithmetic mean

and is calculated as:

coeﬃcient of variation =

¯x

× 100 (2.3)

The variability in groups of observations with widely diﬀering means can be compared using this measure.

Note that all of the input values used to calculate these three statistical measures are algebraic measures.

Thus, the value for the entire database can be eﬃciently calculated by partitioning the database, computing

the values for each of the separate partitions, and then merging theses values into an algebraic equation

that can be used to calculate the value for the entire database.

The measures of dispersion described here were obtained from: Statistical Methods in Research and Pro duc-

tion, fourth ed., edited by Owen L. Davies and Peter L. Goldsmith, Hafner Publishing Company, NY:NY,

1972.

2.4. Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in

increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45,

46, 52, 70.

(a) What is the mean of the data? What is the median?

(b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal, trimodal, etc.).

(d) Can you ﬁnd (roughly) the ﬁrst quartile (Q1) and the third quartile (Q3) of the data?

(e) Give the ﬁve-number summary of the data.

(f) Show a boxplot of the data.

(g) How is a quantile-quantile plot diﬀerent from a quantile plot?

2.8. EXERCISES 15

Answer:

(a) What is the mean of the data? What is the median?

The (arithmetic) mean of the data is: ¯x =

i=1

= 809/27 = 30 (Equation 2.1). The median (middle

value of the ordered set, as the number of values in the set is odd) of the data is: 25.

(b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal, trimodal, etc.).

This data set has two values that occur with the same highest frequency and is, therefore, bimodal.

The modes (values occurring with the greatest frequency) of the data are 25 and 35.

The midrange (average of the largest and smallest values in the data set) of the data is: (70 + 13)/2 =

41.5

(d) Can you ﬁnd (roughly) the ﬁrst quartile (Q1) and the third quartile (Q3) of the data?

The ﬁrst quartile (corresponding to the 25th percentile) of the data is: 20. The third quartile (corre-

sponding to the 75th percentile) of the data is: 35.

(e) Give the ﬁve-number summary of the data.

The ﬁve number summary of a distribution consists of the minimum value, ﬁrst quartile, median value,

third quartile, and maximum value. It provides a good summary of the shape of the distribution and

for this data is: 13, 20, 25, 35, 70.

(f) Show a boxplot of the data. (Omitted here. Please refer to Figure 2.3 of the textbook.)

(g) How is a quantile-quantile plot diﬀerent from a quantile plot?

A quantile plot is a graphical method used to show the approximate percentage of values below or

equal to the independent variable in a univariate distribution. Thus, it displays quantile information

for all the data, where the values measured for the independent variable are plotted against their

corresponding quantile.

A quantile-quantile plot however, graphs the quantiles of one univariate distribution against the corre-

sponding quantiles of another univariate distribution. Both axes display the range of values measured

for their corresponding distribution, and points are plotted that correspond to the quantile values of

the two distributions. A line (y = x) can be added to the graph along with points representing where

the ﬁrst, second and third quantiles lie to increase the graph’s informational value. Points that lie

above such a line indicate a correspondingly higher value for the distribution plotted on the y-axis

than for the distribution plotted on the x-axis at the same quantile. The opposite eﬀect is true for

points lying b elow this line.

2.5. In many applications, new data sets are incrementally added to the existing large data sets. Thus an

important consideration for computing descriptive data summary is whether a measure can be computed

eﬃciently in incremental manner. Use count, standard deviation, and median as examples to show that a

distributive or algebraic measure facilitates eﬃcient incremental computation, whereas a holistic measure

does not.

Answer:

• Count: The current count can be stored as a value, and when x number of new values are added,

we can easily update count with count + x. This is a distributive measure and is easily updated for

incremental additions.

• Standard deviation: If we store the sum of the squared existing values and the count of the existing

values, we can easily generate the new standard deviation using the formula provided in the book.

We simply need to calculate the squared sum of the new numbers, add that to the existing squared

sum, update the count of the numbers, and plug that into the calculation to obtain the new standard

deviation. All of this is done without looking at the whole data set and is thus easy to compute.

16 CHAPTER 2. DATA PREPROCESSING

• Median: To accurately calculate the median, we have to look at every value in the dataset. When we

add a new value or values, we have to sort the new set and then ﬁnd the median based on that new

sorted set. This is much harder and thus makes the incremental addition of new values diﬃcult.

2.6. In real-world data, tuples with missing values for some attributes are a common occurrence. Describe

various methods for handling this problem.

Answer:

The various methods for handling the problem of missing values in data tuples include:

(a) Ignoring the tuple: This is usually done when the class label is missing (assuming the mining task

involves classiﬁcation or description). This method is not very eﬀective unless the tuple contains several

attributes with missing values. It is especially poor when the percentage of missing values per attribute

varies considerably.

(b) Manually ﬁlling in the missing value: In general, this approach is time-consuming and may not

be a reasonable task for large data sets with many missing values, especially when the value to be ﬁlled

in is not easily determined.

the same constant, such as a label like “Unknown,” or −∞. If missing values are replaced by, say,

“Unknown,” then the mining program may mistakenly think that they form an interesting concept,

since they all have a value in common — that of “Unknown.” Hence, although this method is simple,

it is not recommended.

(d) Using the attribute mean for quantitative (numeric) values or attribute mode for categor-

ical (nominal) values: For example, suppose that the average income of AllElectronics customers is

$28,000. Use this value to replace any missing values for income.

(e) Using the attribute mean for quantitative (numeric) values or attribute mode for categor-

ical (nominal) values, for all samples belonging to the same class as the given tuple: For

example, if classifying customers according to credit risk, replace the missing value with the average

income value for customers in the same credit risk category as that of the given tuple.

(f) Using the most probable value to ﬁll in the missing value: This may be determined with

regression, inference-based tools using Bayesian formalism, or decision tree induction. For example,

using the other customer attributes in the data set, we can construct a decision tree to predict the

missing values for income.

2.7. Using the data for age given in Exercise 2.4, answer the following.

(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps.

Comment on the eﬀect of this technique for the given data.

(b) How might you determine outliers in the data?

Answer:

(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps.

Comment on the eﬀect of this technique for the given data.

The following steps are required to smooth the above data using smoothing by bin means with a bin

depth of 3.

• Step 1: Sort the data. (This step is not required here as the data are already sorted.)

• Step 2: Partition the data into equal-frequency bins of size 3.

2.8. EXERCISES 17

Bin 1: 13, 15, 16 Bin 2: 16, 19, 20 Bin 3: 20, 21, 22

Bin 4: 22, 25, 25 Bin 5: 25, 25, 30 Bin 6: 33, 33, 35

Bin 7: 35, 35, 35 Bin 8: 36, 40, 45 Bin 9: 46, 52, 70

• Step 3: Calculate the arithmetic mean of each bin.

• Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin.

Bin 1: 142/3, 142/3, 142/3 Bin 2: 181/3, 181/3, 181/3 Bin 3: 21, 21, 21

Bin 4: 24, 24, 24 Bin 5: 262/3, 262/3, 262/3 Bin 6: 332/3, 332/3, 332/3

Bin 7: 35, 35, 35 Bin 8: 401/3, 401/3, 401/3 Bin 9: 56, 56, 56

(b) How might you determine outliers in the data?

Outliers in the data may be detected by clustering, where similar values are organized into groups, or

“clusters”. Values that fall outside of the set of clusters may be considered outliers. Alternatively, a

combination of computer and human inspection can be used where a predetermined data distribution

is implemented to allow the computer to identify possible outliers. These possible outliers can then be

veriﬁed by human inspection with much less eﬀort than would be required to verify the entire initial

data set.

Other methods that can b e used for data smoothing include alternate forms of binning such as smooth-

ing by bin medians or smoothing by bin boundaries. Alternatively, equal-width bins can be used to

implement any of the forms of binning, where the interval range of values in each bin is constant.

Methods other than binning include using regression techniques to smooth the data by ﬁtting it to a

function such as through linear or multiple regression. Classiﬁcation techniques can be used to imple-

ment concept hierarchies that can smooth the data by rolling-up lower level concepts to higher-level

concepts.

2.8. Discuss issues to consider during data integration.

Answer:

Data integration involves combining data from multiple sources into a coherent data store. Issues that must

be considered during such integration include:

• Schema integration: The metadata from the diﬀerent data sources must be integrated in order to

match up equivalent real-world entities. This is referred to as the entity identiﬁcation problem.

• Handling redundant data: Derived attributes may be redundant, and inconsistent attribute naming

may also lead to redundancies in the resulting data set. Duplications at the tuple level may occur and

thus need to be detected and resolved.

• Detection and resolution of data value conﬂicts: Diﬀerences in representation, scaling, or encod-

ing may cause the same real-world entity attribute values to diﬀer in the data sources being integrated.

2.9. Suppose a hospital tested the age and body fat data for 18 randomly selected adults with the following

result

age 23 23 27 27 39 41 47 49 50

%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2

age 52 54 54 56 57 58 58 60 61

%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

(a) Calculate the mean, median and standard deviation of age and %fat.

(b) Draw the boxplots for age and %fat.

剩余134页未读，继续阅读

jane_lj

粉丝: 0
资源: 6

《数据挖掘概念与技术(英文第2版)》课后习题答案解析

数据挖掘概念与技术习题答案（英文版）范明 孟小峰 译

数据挖掘书（中文）+答案（英文）+PPT（英文）

数据挖掘(概念与技术)_第三版课后习题答案 (2).pdf

数据挖掘概念与技术-2nd版课后答案解析

大数据技术原理与应用 林子雨版 课后习题答案.pdf

数据仓11库与数据挖掘教程（第2版）课后习题答案第五章.pdf

数据仓11库与数据挖掘教程（第2版）课后习题答案第五章 (2).pdf

韩家炜数据挖掘概念与技术（第二版）中英文+课后习题答案中英文合集

大数据技术原理与应用 林子雨版 课后习题答案 (2).pdf

《大数据分析与挖掘》课后习题答案（部分）.pdf

最新资源

数据挖掘概念与技术习题答案（英文版）范明孟小峰译

大数据技术原理与应用林子雨版课后习题答案.pdf

大数据技术原理与应用林子雨版课后习题答案 (2).pdf