Bag of Tricks for Efficient Text Classification
Armand Joulin Edouard Grave Piotr Bojanowski Tomas Mikolov
Facebook AI Research
{ajoulin,egrave,bojanowski,tmikolov}@fb.com
Abstract
This paper explores a simple and efficient
baseline for text classification. Our ex-
periments show that our fast text clas-
sifier fastText is often on par with
deep learning classifiers in terms of ac-
curacy, and many orders of magnitude
faster for training and evaluation. We can
train fastText on more than one bil-
lion words in less than ten minutes using a
standard multicore CPU, and classify half
a million sentences among 312K classes in
less than a minute.
1 Introduction
Text classification is an important task in Natu-
ral Language Processing with many applications,
such as web search, information retrieval, rank-
ing and document classification (Deerwester et
al., 1990; Pang and Lee, 2008). Recently, mod-
els based on neural networks have become in-
creasingly popular (Kim, 2014; Zhang and LeCun,
2015; Conneau et al., 2016). While these models
achieve very good performance in practice, they
tend to be relatively slow both at train and test
time, limiting their use on very large datasets.
Meanwhile, linear classifiers are often consid-
ered as strong baselines for text classification
problems (Joachims, 1998; McCallum and Nigam,
1998; Fan et al., 2008). Despite their simplicity,
they often obtain state-of-the-art performances if
the right features are used (Wang and Manning,
2012). They also have the potential to scale to very
large corpus (Agarwal et al., 2014).
In this work, we explore ways to scale these
baselines to very large corpus with a large output
space, in the context of text classification. Inspired
by the recent work in efficient word representation
learning (Mikolov et al., 2013; Levy et al., 2015),
we show that linear models with a rank constraint
and a fast loss approximation can train on a billion
words within ten minutes, while achieving perfor-
mance on par with the state-of-the-art. We eval-
uate the quality of our approach fastText
1
on
two different tasks, namely tag prediction and sen-
timent analysis.
2 Model architecture
A simple and efficient baseline for sentence clas-
sification is to represent sentences as bag of
words (BoW) and train a linear classifier, e.g., a
logistic regression or an SVM (Joachims, 1998;
Fan et al., 2008). However, linear classifiers do
not share parameters among features and classes.
This possibly limits their generalization in the con-
text of large output space where some classes have
very few examples. Common solutions to this
problem are to factorize the linear classifier into
low rank matrices (Sch
¨
utze, 1992; Mikolov et al.,
2013) or to use multilayer neural networks (Col-
lobert and Weston, 2008; Zhang et al., 2015).
Figure 1 shows a simple linear model with rank
constraint. The first weight matrix A is a look-up
table over the words. The word representations are
then averaged into a text representation, which is
in turn fed to a linear classifier. The text repre-
sentation is an hidden variable which can be po-
tentially be reused. This architecture is similar to
the cbow model of Mikolov et al. (2013), where
the middle word is replaced by a label. We use
the softmax function f to compute the probabil-
ity distribution over the predefined classes. For a
set of N documents, this leads to minimizing the
negative log-likelihood over the classes:
−
1
N
N
X
n=1
y
n
log(f (BAx
n
)),
1
https://github.com/facebookresearch/
fastText