开放源代码工具的文本挖掘与可视化案例研究

需积分: 10 109 浏览量更新于2024-07-18 收藏 19.28MB PDF 举报

"《Text Mining and Visualization》由Markus Hofmann和Andrew Chisholm编辑，是Chapman & Hall/CRC数据挖掘与知识发现系列的一部分。本书专注于使用开源工具进行文本挖掘和可视化，旨在介绍自然语言处理的概念、原理和方法，并通过案例研究来展示这些技术的实际应用。" 正文: 文本挖掘（Text Mining）是一种利用计算机算法从大量文本数据中提取有价值信息的过程，它结合了自然语言处理（Natural Language Processing, NLP）、信息检索、机器学习等多个领域的技术。NLP是人工智能的一个重要分支，专注于理解和生成人类语言，包括词汇分析、语法解析、语义理解等。在《Text Mining and Visualization》一书中，作者深入探讨了以下几个关键知识点： 1. **自然语言处理基础**：包括词法分析（分词）、句法分析（识别句子结构）、语义分析（理解词语的深层含义），以及情感分析（理解文本中的情绪和观点）。 2. **文本预处理**：这是文本挖掘的第一步，涉及去除停用词（如“的”、“是”、“和”等）、词干提取（将动词还原为其基本形式）和词形还原（统一单词的不同形式）。 3. **主题建模**：通过算法如Latent Dirichlet Allocation (LDA) 来识别文本中的主要话题或主题，有助于理解大量文档的集体主题结构。 4. **情感分析与意见挖掘**：用于确定文本中的主观信息，如正面或负面评价，这对于市场研究、产品评论分析等场景非常有用。 5. **实体识别与关系抽取**：识别文本中的专有名词（人名、地名、组织名等）并抽取实体间的关系，有助于信息提取和知识图谱构建。 6. **文本分类与聚类**：利用监督学习或无监督学习方法将文本归类到预定义的类别中，或基于相似性将文本自动分组。 7. **可视化技术**：书中强调了使用开源工具进行可视化的重要性，这可能包括词云、网络图、时间序列分析等多种方式，帮助用户直观理解文本数据的模式和趋势。 8. **案例研究**：通过实际案例，读者可以了解如何将上述理论应用于实践中，解决具体问题，例如社交媒体分析、新闻聚合、舆情监测等。 9. **开源工具的应用**：书中涵盖了如NLTK (Natural Language Toolkit)、spaCy、Gensim、Scikit-learn等流行的Python库，以及用于可视化的工具如Matplotlib、Seaborn和NetworkX。《Text Mining and Visualization》作为数据挖掘和知识发现系列的一部分，不仅提供了理论知识，还强调了工具的使用和实际应用，对于希望掌握文本分析和可视化技术的读者来说，是一本宝贵的资源。通过学习本书，读者能够具备处理大规模文本数据的能力，从而在信息爆炸的时代中发掘有价值的信息。

Foreword

Data analysis has received a lot of attention in recent years and the newly coined data

scientist is on everybody’s radar. However, in addition to the inherent crop of new buzz

words, two fundamental things have changed. Data analysis now relies on more complex

and heterogeneous data sources; users are no longer content with analyzing a few numbers.

They want to integrate data from diﬀerent sources, scrutinizing data of diverse types. Almost

more importantly, tool providers and users have realized that no single proprietary software

vendor can provide the wealth of tools required for the job. This has sparked a huge increase

in open-source software used for professional data analysis.

The timing of this book could not be better. It focuses on text mining, text being one

of the data sources still to be truly harvested, and on open-source tools for the analysis and

visualization of textual data. It explores the top-two representatives of two very diﬀerent

types of tools: programming languages and visual workﬂow editing environments. R and

Python are now in widespread use and allow experts to program highly versatile code for

sophisticated analytical tasks. At the other end of the spectrum are visual workﬂow tools

that enable even nonexperts to use predeﬁned templates (or blueprints) and modify analyses.

Using a visual workﬂow has the added beneﬁt that intuitive documentation and guidance

through the process is created implicitly. RapidMiner (version 5.3, which is still open source)

and KNIME are examples of these types of tools. It is worth noting that especially the

latter stands on the shoulders of giants: KNIME integrates not only R and Python but also

various libraries. (Stanford’s NLP package and the Apache openNLP project, among others,

are examined more closely in the book.) These enable the use of state-of-the-art methods

via an easy-to-use graphical workﬂow editor.

In a way, the four parts of this book could therefore be read front to back. The reader

starts with a visual workbench, assembling complex analytical workﬂows. But when a certain

method is missing, the user can draw on the preferred analytical scripting language to access

bleeding-edge technology that has not yet been exposed natively as a visual component. The

reverse order also works. Expert coders can continue to work the way they like to work by

quickly writing eﬃcient code, and at the same time they can wrap their code into visual

components and make that wisdom accessible to nonexperts as well!

Markus and Andrew have done an outstanding job bringing together this volume of

both introductory and advanced material about text mining using modern open source

technology in a highly accessible way.

Prof. Dr. Michael Berthold (University Konstanz, Germany)

Preface

When people communicate, they do it in lots of ways. They write books and articles, create

blogs and webpages, interact by sending messages in many diﬀerent ways, and of course

they speak to one another. When this happens electronically, these text data become very

accessible and represent a signiﬁcant and increasing resource that has tremendous potential

value to a wide range of organisations. This is because text data represent what people are

thinking or feeling and with whom they are interacting, and thus can be used to predict

what people will do, how they are feeling about a particular product or issue, and also who

else in their social group could be similar. The process of extracting value from text data,

known as text mining, is the subject of this book.

There are challenges, of course. In recent years, there has been an undeniable explosion

of text data being produced from a multitude of sources in large volumes and at great speed.

This is within the context of the general huge increases in all forms of data. This volume and

variety require new techniques to be applied to the text data to deal with them eﬀectively.

It is also true that text data by their nature tend to be unstructured, which requires speciﬁc

techniques to be adopted to clean and restructure them. Interactions between people leads

to the formation of networks, and to understand and exploit these requires an understanding

of some potentially complex techniques.

It remains true that organisations wishing to exploit text data need new ways of working

to stay ahead and to take advantage of what is available. These include general knowledge of

the latest and most powerful tools, understanding the data mining process, understanding

speciﬁc text mining activities, and simply getting an overview of what possibilities there

are.

This book provides an introduction to text mining using some of the most popular and

powerful open-source tools, KNIME, RapidMiner, Weka, R, and Python. In addition, the

Many Eyes website is used to help visualise results. The chapters show text data being

gathered and processed from a wide variety of sources, including books, server-access logs,

websites, social media sites, and message boards. Each chapter within the book is presented

as an example use-case that the reader can follow as part of a step-by-step reproducible

example. In the real world, no two problems are the same, and it would be impossible to

produce a use case example for every one. However, the techniques, once learned, can easily

be applied to other problems and extended. All the examples are downloadable from the

website that accompanies this book and the use of open-source tools ensures that they are

readily accessible. The book’s website is

http://www.text-mining-book.com

Text mining is a subcategory within data mining as a whole, and therefore the chapters

illustrate a number of data mining techniques including supervised learning using classiﬁers

such as na¨ıve Bayes and support vector machines; cross-validation to estimate model per-

xvii

xviii Preface

formance using a variety of performance measures; and unsupervised clustering to partition

data into clusters.

Data mining requires signiﬁcant preprocessing activities such as cleaning, restructur-

ing, and handling missing values. Text mining also requires these activities particularly

when text data is extracted from webpages. Text mining also introduces new preprocessing

techniques such as tokenizing, stemming, and generation of n-grams. These techniques are

amply illustrated in many of the chapters. In addition some novel techniques for applying

network methods to text data gathered in the context of message websites are shown.

What Is the Structure of This Book, and Which Chapters Should I

Read?

The book consists of four main parts corresponding to the main tools used: RapidMiner,

KNIME, Python, and R.

Part 1 about RapidMiner usage contains two chapters. Chapter 1 is titled “Rapid-

Miner for Text Analytic Fundamentals” and is a practical introduction to the use of various

open-source tools to perform the basic but important preprocessing steps that are usually

necessary when performing any type of text mining exercise. RapidMiner is given particular

focus, but the MySQL database and Many Eyes visualisation website are also used. The spe-

ciﬁc text corpus that is used consists of the inaugural speeches made by US presidents, and

the objective of the chapter is to preprocess and import these suﬃciently to give visibility

to some of the features within them. The speeches themselves are available on the Internet,

and the chapter illustrates how to use RapidMiner to access their locations to download the

content as well as to parse it so that only the text is used. The chapter illustrates storing

the speeches in a database and goes on to show how RapidMiner can be used to perform

tasks like tokenising to eliminate punctuation, numbers, and white space as part of building

a word vector. Stop word removal using both standard English and a custom dictionary

is shown. Creation of word n-grams is also shown as well as techniques for ﬁltering them.

The ﬁnal part of the chapter shows how the Many Eyes online service can take the output

from the process to visualise it using a word cloud. At all stages, readers are encouraged to

recreate and modify the processes for themselves.

Chapter 2 is more advanced and is titled “Empirical Zipf-Mandelbrot Variation for

Sequential Windows within Documents”. It relates to the important area of authorship

attribution within text mining. This technique is used to determine the author of a piece

of text or sometimes who the author is not. Many attribution techniques exist, and some

are based to a certain extent on departures from Zipf’s law. This law states that the rank

and frequency of common words when multiplied together yield a constant. Clearly this

is a simpliﬁcation, and the deviations from this for a particular author may reveal a style

representative of the author. Modiﬁcations to Zipf’s law have been proposed, one of which

is the Zipf-Mandelbrot law. The deviations from this law may reveal similarities for works

produced by the same author. This chapter uses an advanced RapidMiner process to ﬁt,

using a genetic algorithm approach, works by diﬀerent authors to Zipf-Mandelbrot models

and determines the deviations to visualize what similarities there are between authors.

Preface xix

Additionally, an author’s work is randomised to produce a random sampling to determine

how diﬀerent the actual works are from a random book to show whether the order of words

in a book contributes to an author’s style. The results are visualised using R and show some

evidence that diﬀerent authors have similarities of style that is not random.

Part 2 of the book describes the use of the Konstanz Information Miner (KNIME)

and again contains two chapters. Chapter 3 introduces the text processing capabilities of

KNIME and is titled “Introduction to the KNIME Text Processing Extension”. KNIME is

a popular open-source platform that uses a visual paradigm to allow processes to be rapidly

assembled and executed to allow all data processing, analysis, and mining problems to be

addressed. The platform has a plug-in architecture that allows extensions to be installed,

and one such is the text processing feature. This chapter describes the installation and use

of this extension as part of a text mining process to predict sentiment of movie reviews. The

aim of the chapter is to give a good introduction to the use of KNIME in the context of this

overall classiﬁcation process, and readers can use the ideas and techniques for themselves.

The chapter gives more background details about the important preprocessing activities

that are typically undertaken when dealing with text. These include entity recognition

such as the identiﬁcation of names or other domain-speciﬁc items, and tagging parts of

speech to identify nouns, verbs, and so on. An important point that is especially relevant as

data volumes increase is the possibility to perform processing activities in parallel to take

advantage of available processing power, and to reduce the total time to process. Common

preprocessing activities such as stemming, number removal, punctuation, handling small

and stop words that are described in other chapters with other tools can also be performed

with KNIME. The concepts of documents and the bag of words representation are described

and the diﬀerent types of word or document vectors that can be produced are explained.

These include term frequencies but can use inverse document frequencies if the problem at

hand requires it. Having described the background, the chapter then uses the techniques to

build a classiﬁer to predict positive or negative movie reviews based on available training

data. This shows use of other parts of KNIME to build a classiﬁer on training data, to apply

it to test data, and to observe the accuracy of the prediction.

Chapter 4 is titled “Social Media Analysis — Text Mining Meets Network Mining” and

presents a more advanced use of KNIME with a novel way to combine sentiment of users

with how they are perceived as inﬂuencers in the Slashdot online forum. The approach is

motivated by the marketing needs that companies have to identify users with certain traits

and ﬁnd ways to inﬂuence them or address the root causes of their views. With the ever

increasing volume and types of online data, this is a challenge in its own right, which makes

ﬁnding something actionable in these fast-moving data sources diﬃcult. The chapter has

two parts that combine to produce the result. First, a process is described that gathers

user reviews from the Slashdot forum to yield an attitude score for each user. This score

is the diﬀerence between positive and negative words, which is derived from a lexicon, the

MPQA subjectivity lexicon in this case, although others could be substituted as the domain

problem dictates. As part of an exploratory conﬁrmation, a tag cloud of words used by an

individual user is also drawn where negative and positive words are rendered in diﬀerent

colours. The second part of the chapter uses network analysis to ﬁnd users who are termed

leaders and those who are followers. A leader is one whose published articles gain more

comments from others, whereas a follower is one who tends to comment more. This is done

in KNIME by using the HITS algorithm often used to rate webpages. In this case, users take

the place of websites, and authorities become equivalent to leaders and hubs followers. The

two diﬀerent views are then combined to determine the characteristics of leaders compared

with followers from an attitude perspective. The result is that leaders tend to score more

剩余336页未读，继续阅读

zzyszc

粉丝: 0
资源: 3

开放源代码工具的文本挖掘与可视化案例研究

An Introduction to Text Mining: Research Design, Data Collection, and Analysis

R.Data.Mining.Projects.1783989688

Visual Knowledge Discovery and Machine Learning-Springer(2018).pdf

Introducing Data Science - Big Data, Machine Learning and more, using Python

【Advanced Chapter】Web Crawler Data Analysis and Visualization: Practical Implementation Using ...

The Integration of YOLOv8 with Big Data Analytics: Image Data Mining and Deep Learning Applications

[Practical Exercise] Data Storage and Analysis: Storing Scraped Data to Hadoop HDFS and Processing ...

"MATLAB Masterclass: Reading TXT Files": Parsing Text Data to Enhance Data Processing Efficiency

Formatting and Parsing Date Data in MATLAB

[Advanced Chapter] Detailed Explanation of GUI Design and Interactive Applications in MATLAB

最新资源