Identifying Comparative Sentences in Text Documents
Nitin Jindal and Bing Liu
Department of Computer Science
University of Illinois at Chicago
851 South Morgan Street
Chicago, IL 60607-7053
{njindal, liub}@cs.uic.edu
ABSTRACT
This paper studies the problem of identifying comparative
sentences in text documents. The problem is related to but quite
different from sentiment/opinion sentence identification or
classification. Sentiment classification studies the problem of
classifying a document or a sentence based on the subjective
opinion of the author. An important application area of
sentiment/opinion identification is business intelligence as a
product manufacturer always wants to know consumers’ opinions
on its products. Comparisons on the other hand can be subjective
or objective. Furthermore, a comparison is not concerned with an
object in isolation. Instead, it compares the object with others. An
example opinion sentence is “the sound quality of CD player X is
poor”. An example comparative sentence is “the sound quality of
CD player X is not as good as that of CD player Y”. Clearly, these
two sentences give different information. Their language
constructs are quite different too. Identifying comparative
sentences is also useful in practice because direct comparisons are
perhaps one of the most convincing ways of evaluation, which
may even be more important than opinions on each individual
object. This paper proposes to study the comparative sentence
identification problem. It first categorizes comparative sentences
into different types, and then presents a novel integrated pattern
discovery and supervised learning approach to identifying
comparative sentences from text documents. Experiment results
using three types of documents, news articles, consumer reviews
of products, and Internet forum postings, show a precision of 79%
and recall of 81%. More detailed results are given in the paper.
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information Search
and Retrieval – Information filtering. I.2.7 [Artificial
Intelligence]: Natural Language Processing – text analysis.
General Terms
Algorithms, Performance.
Keywords
Comparative sentences, sentiment classification, text mining.
1. INTRODUCTION
Comparisons are one of the most convincing ways of evaluation.
Extracting comparative sentences from text is useful for many
applications. For example, in the business environment, whenever
a new product comes into market, the product manufacturer wants
to know consumer opinions on the product, and how the product
compares with those of its competitors. Much of such information
is now readily available on the Web in the form of customer
reviews, forum discussions, blogs, etc. Extracting such
information can significantly help businesses in their marketing
and product benchmarking efforts. In this paper, we focus on
comparisons. Clearly, product comparisons are not only useful for
product manufacturers, but also to potential customers as they
enable customers to make better purchasing decisions.
In the past few years, a significant amount of research was done
on sentiment and opinion extraction and classification. In Section
2, we will discuss the existing literature and compare it with our
work, where related research from linguistics is also included.
Comparisons are related but also quite different from sentiments
and opinions, which are subjective. Comparisons on the other
hand can be subjective or objective. For example, an opinion
sentence on a car may be “Car X is very ugly”. A subjective
comparative sentence may be
“Car X is much better than Car Y”
An objective comparative sentence may be
“Car X is 2 feet longer than Car Y”
We can see that in general comparative sentences use quite
different language constructs from typical opinion sentences
(although the first sentence above is also an opinion). In this
paper, we aim to study the problem of identifying comparative
sentences in text documents, e.g., news articles, consumer reviews
of products, forum discussions. This problem is challenging
because although we can see that the above example sentences all
contain some indicators (comparative adverbs and comparative
adjectives), i.e., “better”, “longer”, many sentences that contain
such words are not comparatives, e.g., “I cannot agree with you
more”. Similarly, many sentences that do not contain such
indicators are comparative sentences, e.g., “Cellphone X has
Bluetooth, but cellphone Y does not have.”
In this paper, we first classify comparative sentences into
different categories based on existing linguistic research. We also
expand them with additional categories that are important in
practice. We then propose a novel approach based on pattern
discovery and supervised learning to identify comparative
sentences. The basic idea of our technique is to first use a
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SIGIR’06, August 6-11, 2006, Seattle, Washington, USA.
Copyright 2006 ACM 1-59593-369-7/06/0008...$5.00.