Preface xix
Additionally, an author’s work is randomised to produce a random sampling to determine
how different the actual works are from a random book to show whether the order of words
in a book contributes to an author’s style. The results are visualised using R and show some
evidence that different authors have similarities of style that is not random.
Part 2 of the book describes the use of the Konstanz Information Miner (KNIME)
and again contains two chapters. Chapter 3 introduces the text processing capabilities of
KNIME and is titled “Introduction to the KNIME Text Processing Extension”. KNIME is
a popular open-source platform that uses a visual paradigm to allow processes to be rapidly
assembled and executed to allow all data processing, analysis, and mining problems to be
addressed. The platform has a plug-in architecture that allows extensions to be installed,
and one such is the text processing feature. This chapter describes the installation and use
of this extension as part of a text mining process to predict sentiment of movie reviews. The
aim of the chapter is to give a good introduction to the use of KNIME in the context of this
overall classification process, and readers can use the ideas and techniques for themselves.
The chapter gives more background details about the important preprocessing activities
that are typically undertaken when dealing with text. These include entity recognition
such as the identification of names or other domain-specific items, and tagging parts of
speech to identify nouns, verbs, and so on. An important point that is especially relevant as
data volumes increase is the possibility to perform processing activities in parallel to take
advantage of available processing power, and to reduce the total time to process. Common
preprocessing activities such as stemming, number removal, punctuation, handling small
and stop words that are described in other chapters with other tools can also be performed
with KNIME. The concepts of documents and the bag of words representation are described
and the different types of word or document vectors that can be produced are explained.
These include term frequencies but can use inverse document frequencies if the problem at
hand requires it. Having described the background, the chapter then uses the techniques to
build a classifier to predict positive or negative movie reviews based on available training
data. This shows use of other parts of KNIME to build a classifier on training data, to apply
it to test data, and to observe the accuracy of the prediction.
Chapter 4 is titled “Social Media Analysis — Text Mining Meets Network Mining” and
presents a more advanced use of KNIME with a novel way to combine sentiment of users
with how they are perceived as influencers in the Slashdot online forum. The approach is
motivated by the marketing needs that companies have to identify users with certain traits
and find ways to influence them or address the root causes of their views. With the ever
increasing volume and types of online data, this is a challenge in its own right, which makes
finding something actionable in these fast-moving data sources difficult. The chapter has
two parts that combine to produce the result. First, a process is described that gathers
user reviews from the Slashdot forum to yield an attitude score for each user. This score
is the difference between positive and negative words, which is derived from a lexicon, the
MPQA subjectivity lexicon in this case, although others could be substituted as the domain
problem dictates. As part of an exploratory confirmation, a tag cloud of words used by an
individual user is also drawn where negative and positive words are rendered in different
colours. The second part of the chapter uses network analysis to find users who are termed
leaders and those who are followers. A leader is one whose published articles gain more
comments from others, whereas a follower is one who tends to comment more. This is done
in KNIME by using the HITS algorithm often used to rate webpages. In this case, users take
the place of websites, and authorities become equivalent to leaders and hubs followers. The
two different views are then combined to determine the characteristics of leaders compared
with followers from an attitude perspective. The result is that leaders tend to score more