Using TF-IDF to Determine Word Relevance in Document Queries
Juan Ramos JURAMOS@EDEN.RUTGERS.EDU
Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855
Abstract
In this paper, we examine the results of applying
Term Frequency Inverse Document Frequency
(TF-IDF) to determine what words in a corpus of
documents might be more favorable to use in a
query. As the term implies, TF-IDF calculates
values for each word in a document through an
inverse proportion of the frequency of the word
in a particular document to the percentage of
documents the word appears in. Words with
high TF-IDF numbers imply a strong
relationship with the document they appear in,
suggesting that if that word were to appear in a
query, the document could be of interest to the
user. We provide evidence that this simple
algorithm efficiently categorizes relevant words
that can enhance query retrieval.
1. Introduction
Before proceeding in depth into our experiments, it is
useful to describe the nature of the query retrieval
problem for a corpus of documents and the different
approaches used to solve it, including TF-IDF.
1.1 Query Retrieval Problem
The task of retrieving data from a user-defined query has
become so common and natural in recent years that some
might not give it a second thought. However, this
growing use of query retrieval warrants continued
research and enhancements to generate better solutions to
the problem.
Informally, query retrieval can be described as the task of
searching a collection of data, be that text documents,
databases, networks, etc., for specific instances of that
data. First, we will limit ourselves to searching a
collection of English documents. The refined problem
then becomes the task of searching this corpus for
documents that the query retrieval system considers
relevant to what the user entered as the query.
Let us describe this problem more formally. We have a
set of documents D, with the user entering a query q = w
1
,
w
2
, …, w
n
for a sequence of words w
i
. Then we wish to
return a subset D
*
of D such that for each d є D
*
, we
maximize the following probability:
P(d | q, D) (1)
(Berger & Lafferty, 1999). As the above notation
suggests, numerous approaches to this problem involve
probability and statistics, while others propose vector-
based models to enhance the retrieval.
1.2 Algorithms for Ad-Hoc Retrieval
Let us briefly examine other approaches used for
responding to queries. Intuitively, given the formal
notation we present for the problem, the use of statistical
methods has proven both popular and efficient in
responding to the problem. (Berger & Lafferty, 1999) for
example, propose a probabilistic framework that
incorporates the user’s mindset at the time the query was
entered to enhance their approximations. They suggest
that the user has a specific information need G, which is
approximated as a sequence of words q in the actual
query. By accounting for this noisy transformation of G
into q and applying Bayes’ Law to equation (1), they
show good results on returning appropriate documents
given q.
Vector-based methods for performing query retrieval also
show good promise. (Berry, Dumais & O’Brien, 1994)
suggest performing query retrieval using a popular matrix
algorithm called Latent Semantic Indexing (LSI). In
essence, the algorithm creates a reduced-dimensional
vector space that captures an n-dimensional representation
of a set of documents. When a query is entered, its
numerical representation is compared the cosine-distance
of other documents in the document space, and the
algorithm returns documents where this distance is small.
The authors’ experimental results show that this algorithm
is highly effective in query retrieval, even when the
problem entails performing information retrieval over
documents written in different languages (Littman &
Keim 1997). If certain criteria are met, they suggest that
the LSI approach can be extended to more than two
languages.
The procedure we examine with more detail is Term
Frequency Inverse Document Frequency (TF-IDF). This
weighing scheme can be categorized as a statistical