文本挖掘中的异常分析方法综述：突出文本领域关键技术

4星 · 超过85%的资源需积分: 10 65 浏览量更新于2024-07-18 1 收藏 7.17MB PDF 举报

《异常点分析》是一本由Charu C. Aggarwal所著的第二版书籍，专注于文本挖掘领域的重要方法和算法。随着社交媒体、网络和信息中心应用的迅速增长，大量的文本数据涌现，对有效处理这些非结构化数据的需求变得日益迫切。本书旨在为读者提供一个全面的视角，特别关注在文本挖掘中常用的分析技术。异常点分析（Outlier Analysis）是数据分析中的一个重要环节，它主要关注识别数据集中与常态偏离的观测值或样本。在IT行业中，这在许多场景下都具有实际应用价值，如网络日志分析、金融欺诈检测、市场营销中的异常行为识别、用户行为模式挖掘等。在大数据时代，异常点往往包含有价值的信息，可能是新的趋势、错误或者潜在问题的信号。书中涵盖了以下关键知识点： 1. **数据挖掘基础**：介绍数据挖掘的基本概念和技术，包括数据预处理、特征选择、模式发现和分类等，这些都是异常点检测的前提。 2. **文本数据处理**：针对文本数据的特殊性，讲解如何进行文本清洗、分词、词干提取和特征工程，以便于后续分析。 3. **统计方法**：探讨使用统计学原理，如z-score、IQR（四分位距）和箱线图等，来度量数据点的离群程度。 4. **机器学习算法**：包括监督学习（如支持向量机、随机森林）、无监督学习（如聚类、DBSCAN）以及深度学习（如神经网络）在异常检测中的应用，这些方法可以自动学习并识别潜在的异常模式。 5. **异常检测模型**：介绍各种异常检测模型，如孤立森林、One-Class SVM、LOF（局部异常因子）等，它们能有效地识别出数据中的异常点。 6. **实时和在线异常检测**：讨论如何在大规模流式数据环境中实现高效且实时的异常点检测。 7. **案例研究与实践应用**：书中会通过实际案例展示如何将理论知识应用于实际问题解决，帮助读者理解异常点分析在不同领域的具体操作和优化策略。 8. **最新进展和未来方向**：总结当前异常点检测领域的前沿研究，探讨未来可能的技术发展和挑战。《异常点分析》为从事数据科学、机器学习和文本挖掘领域的专业人士提供了一本实用的参考书，不仅深入浅出地讲解了基本概念，还强调了实际应用中的策略和技巧，对于理解和解决现实生活中的异常检测问题具有很高的价值。

CONTENTS

We note that these books are quite outdated, and the most recent among them is a decade

old. Furthermore, this (most recent) book is really focused on the relationship between

regression and outlier analysis, rather than the latter. Outlier analysis is a much broader

area, in which regression analysis is only a small part. The other books are even older, and

are between 15 and 25 years old. They are exclusively targeted to the statistics community.

This is not surprising, given that the ﬁrst mainstream computer science conference in data

mining (KDD) was organized in 1995. Most of the work in the data-mining community

was performed after the writing of these books. Therefore, many key topics of interest

to the broader data mining community are not covered in these books. Given that outlier

analysis has been explored by a much broader community, including databases, data mining,

statistics, and machine learning, we feel that our book incorporates perspectives from a much

broader audience and brings together diﬀerent points of view.

The chapters of this book have been organized carefully, with a view of covering the

area extensively in a natural order. Emphasis was placed on simplifying the content, so

that students and practitioners can also beneﬁt from the book. While we did not originally

intend to create a textbook on the subject, it evolved during the writing process into a

work that can also be used as a teaching aid. Furthermore, it can also be used as a reference

book, since each chapter contains extensive bibliographic notes. Therefore, this book serves

a dual purpose, providing a comprehensive exposition of the topic of outlier detection from

multiple points of view.

Additional Notes for the Second Edition

The second edition of this book is a signiﬁcant enhancement over the ﬁrst edition. In par-

ticular, most of the chapters have been upgraded with new material and recent techniques.

More explanations have been added at several places and newer techniques have also been

added. An entire chapter on outlier ensembles has been added. Many new topics have been

added to the book such as feature selection, one-class support vector machines, one-class

neural networks, matrix factorization, spectral methods, wavelet transforms, and supervised

learning. Every chapter has been updated with the latest algorithms on the topic.

Last but not least, the ﬁrst edition was classiﬁed by the publisher as a monograph,

whereas the second edition is formally classiﬁed as a textbook. The writing style has been

enhanced to be easily understandable to students. Many algorithms have been described in

greater detail, as one might expect from a textbook. It is also accompanied with a solution

manual for classroom teaching.

xviii

Acknowledgments

First Edition

I would like to thank my wife and daughter for their love and support during the writing

of this book. The writing of a book requires signiﬁcant time that is taken away from family

members. This book is the result of their patience with me during this time. I also owe my

late parents a debt of gratitude for instilling in me a love of education, which has played an

important inspirational role in my book-writing eﬀorts.

I would also like to thank my manager Nagui Halim for providing the tremendous support

necessary for the writing of this book. His professional support has been instrumental for

my many book eﬀorts in the past and present.

Over the years, I have beneﬁted from the insights of numerous collaborators. An in-

complete list of these long-term collaborators in alphabetical order is Tarek F. Abdelzaher,

Jiawei Han, Thomas S. Huang, Latifur Khan, Mohammad M. Masud, Spiros Papadimitriou,

Guojun Qi, and Philip S. Yu. I would like to thank them for their collaborations and insights

over the course of many years.

I would also like to specially thank my advisor James B. Orlin for his guidance during

my early years as a researcher. While I no longer work in the same area, the legacy of what

I learned from him is a crucial part of my approach to research. In particular, he taught

me the importance of intuition and simplicity of thought in the research process. These

are more important aspects of research than is generally recognized. This book is written

in a simple and intuitive style, and is meant to improve accessibility of this area to both

researchers and practitioners.

Finally, I would like to thank Lata Aggarwal for helping me with some of the ﬁgures

created using PowerPoint graphics in this book.

Acknowledgments for Second Edition

I received signiﬁcant feedback from various colleagues during the writing of the second

edition. In particular, I would like to acknowledge Leman Akoglu, Chih-Jen Lin, Saket

Sathe, Jiliang Tang, and Suhang Wang. Leman and Saket provided detailed feedback on

several sections and chapters of this book.

xix

Author Biography

Charu C. Aggarwal is a Distinguished Research Staﬀ Member (DRSM) at the IBM

T. J. Watson Research Center in Yorktown Heights, New York. He completed his under-

graduate degree in Computer Science from the Indian Institute of Technology at Kan-

pur in 1993 and his Ph.D. from the Massachusetts Institute of Technology in 1996.

He has worked extensively in the ﬁeld of data mining. He has pub-

lished more than 300 papers in refereed conferences and journals and

authored over 80 patents. He is the author or editor of 15 books,

including a textbook on data mining and a comprehensive book on

outlier analysis. Because of the commercial value of his patents, he

has thrice been designated a Master Inventor at IBM. He is a recipi-

ent of an IBM Corporate Award (2003) for his work on bio-terrorist

threat detection in data streams, a recipient of the IBM Outstand-

ing Innovation Award (2008) for his scientiﬁc contributions to privacy

technology, a recipient of two IBM Outstanding Technical Achievement Awards (2009, 2015)

for his work on data streams and high-dimensional data, respectively. He received the EDBT

2014 Test of Time Award for his work on condensation-based privacy-preserving data min-

ing. He is also a recipient of the IEEE ICDM Research Contributions Award (2015), which

is one of the two highest awards for inﬂuential research contributions in the ﬁeld of data

mining.

He has served as the general co-chair of the IEEE Big Data Conference (2014) and as

the program co-chair of the ACM CIKM Conference (2015), the IEEE ICDM Conference

(2015), and the ACM KDD Conference (2016). He served as an associate editor of the IEEE

Transactions on Knowledge and Data Engineering from 2004 to 2008. He is an associate

editor of the ACM Transactions on Knowledge Discovery from Data, an associate editor of

the IEEE Transactions on Big Data, an action editor of the Data Mining and Knowledge

Discovery Journal, editor-in-chief of the ACM SIGKDD Explorations, and an associate

editor of the Knowledge and Information Systems Journal. He serves on the advisory board

of the Lecture Notes on Social Networks, a publication by Springer. He has served as the

vice-president of the SIAM Activity Group on Data Mining and is a member of the SIAM

industry committee. He is a fellow of the SIAM, ACM, and the IEEE, for “contributions to

knowledge discovery and data mining algorithms.”

xxi

Chapter 1

An Introduction to Outlier Analysis

“Never take the comment that you are diﬀerent as a condemnation, it might

be a compliment. It might mean that you possess unique qualities that, like the

most rarest of diamonds is ... one of a kind.” – Eugene Nathaniel Butler

1.1 Introduction

An outlier is a data point that is signiﬁcantly diﬀerent from the remaining data. Hawkins

deﬁned [249] an outlier as follows:

“An outlier is an observation which deviates so much from the other observations

as to arouse suspicions that it was generated by a diﬀerent mechanism.”

Outliers are also referred to as abnormalities, discordants, deviants,oranomalies in the

data mining and statistics literature. In most applications, the data is created by one or

more generating processes, which could either reﬂect activity in the system or observations

collected about entities. When the generating process behaves unusually, it results in the

creation of outliers. Therefore, an outlier often contains useful information about abnormal

characteristics of the systems and entities that impact the data generation process. The

recognition of such unusual characteristics provides useful application-speciﬁc insights. Some

examples are as follows:

• Intrusion detection systems: In many computer systems, diﬀerent types of data

are collected about the operating system calls, network traﬃc, or other user actions.

This data may show unusual behavior because of malicious activity. The recognition

of such activity is referred to as intrusion detection.

• Credit-card fraud: Credit-card fraud has become increasingly prevalent because of

greater ease with which sensitive information such as a credit-card number can be

compromised. In many cases, unauthorized use of a credit card may show diﬀerent

patterns, such as buying sprees from particular locations or very large transactions.

Such patterns can be used to detect outliers in credit-card transaction data.

C.C. Aggarwal, Outlier Analysis, DOI 10.1007/978-3-319-47578-3_1

2 CHAPTER 1. AN INTRODUCTION TO OUTLIER ANALYSIS

• Interesting sensor events: Sensors are often used to track various environmen-

tal and location parameters in many real-world applications. Sudden changes in the

underlying patterns may represent events of interest. Event detection is one of the

primary motivating applications in the ﬁeld of sensor networks. As discussed later in

this book, event detection is an important temporal version of outlier detection.

• Medical diagnosis: In many medical applications, the data is collected from a va-

riety of devices such as magnetic resonance imaging (MRI) scans, positron emission

tomography (PET) scans or electrocardiogram (ECG) time-series. Unusual patterns

in such data typically reﬂect disease conditions.

• Law enforcement: Outlier detection ﬁnds numerous applications in law enforcement,

especially in cases where unusual patterns can only be discovered over time through

multiple actions of an entity. Determining fraud in ﬁnancial transactions, trading

activity, or insurance claims typically requires the identiﬁcation of unusual patterns

in the data generated by the actions of the criminal entity.

• Earth science: A signiﬁcant amount of spatiotemporal data about weather patterns,

climate changes, or land-cover patterns is collected through a variety of mechanisms

such as satellites or remote sensing. Anomalies in such data provide signiﬁcant insights

about human activities or environmental trends that may be the underlying causes.

In all these applications, the data has a “normal” model, and anomalies are recognized as

deviations from this normal model. Normal data points are sometimes also referred to as

inliers. In some applications such as intrusion or fraud detection, outliers correspond to

sequences of multiple data points rather than individual data points. For example, a fraud

event may often reﬂect the actions of an individual in a particular sequence. The speciﬁcity

of the sequence is relevant to identifying the anomalous event. Such anomalies are also

referred to as collective anomalies, because they can only be inferred collectively from a set

or sequence of data points. Such collective anomalies are often a result of unusual events

that generate anomalous patterns of activity. This book will address these diﬀerent types

of anomalies.

The output of an outlier detection algorithm can be one of two types:

• Outlier scores: Most outlier detection algorithms output a score quantifying the

level of “outlierness” of each data point. This score can also be used to rank the data

points in order of their outlier tendency. This is a very general form of output, which

retains all the information provided by a particular algorithm, but it does not provide

a concise summary of the small number of data points that should be considered

outliers.

• Binary labels: A second type of output is a binary label indicating whether a data

point is an outlier or not. Although some algorithms might directly return binary

labels, outlier scores can also be converted into binary labels. This is typically achieved

by imposing thresholds on outlier scores, and the threshold is chosen based on the

statistical distribution of the scores. A binary labeling contains less information than

a scoring mechanism, but it is the ﬁnal result that is often needed for decision making

in practical applications.

It is often a subjective judgement, as to what constitutes a “suﬃcient” deviation for

a point to be considered an outlier. In real applications, the data may be embedded in a

剩余480页未读，继续阅读

qq_21583515

粉丝: 0
资源: 16

文本挖掘中的异常分析方法综述：突出文本领域关键技术

Outlier Analysis 2nd Edition 中文 part2.pdf

outlier-1.zip

Outlier Analysis 2nd Edition 中文 part3.pdf

matlab如何检测出异常值

pandas describe异常值处理

#Outlier removal

Outlier Detection

异常值检测的常见方法有哪些

帮我用python写一个箱型图分析来检验异常值的代码

电力系统基于残差的bad data detection代码

最新资源