How to Fine-Tune BERT for Text Classification?
Chi Sun, Xipeng Qiu
∗
, Yige Xu, Xuanjing Huang
Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
School of Computer Science, Fudan University
825 Zhangheng Road, Shanghai, China
{sunc17,xpqiu,ygxu18,xjhuang}@fudan.edu.cn
Abstract
Language model pre-training has proven to be
useful in learning universal language represen-
tations. As a state-of-the-art language model
pre-training model, BERT (Bidirectional En-
coder Representations from Transformers) has
achieved amazing results in many language
understanding tasks. In this paper, we con-
duct exhaustive experiments to investigate dif-
ferent fine-tuning methods of BERT on text
classification task and provide a general solu-
tion for BERT fine-tuning. Finally, the pro-
posed solution obtains new state-of-the-art re-
sults on eight widely-studied text classification
datasets.
1
1 Introduction
Text classification is a classic problem in Natural
Language Processing (NLP). The task is to assign
predefined categories to a given text sequence. An
important intermediate step is the text representa-
tion. Previous work uses various neural models
to learn text representation, including convolution
models (Kalchbrenner et al., 2014; Zhang et al.,
2015; Conneau et al., 2016; Johnson and Zhang,
2017; Zhang et al., 2017; Shen et al., 2018), re-
current models (Liu et al., 2016; Yogatama et al.,
2017; Seo et al., 2017), and attention mechanisms
(Yang et al., 2016; Lin et al., 2017).
Alternatively, substantial work has shown that
pre-trained models on large corpus are beneficial
for text classification and other NLP tasks, which
can avoid training a new model from scratch. One
kind of pre-trained models is the word embed-
dings, such as word2vec (Mikolov et al., 2013)
and GloVe (Pennington et al., 2014), or the con-
textualized word embeddings, such as CoVe (Mc-
Cann et al., 2017) and ELMo (Peters et al.,
∗
Corresponding author
1
The source codes are available at https://github.
com/xuyige/BERT4doc-Classification.
2018). These word embeddings are often used
as additional features for the main task. An-
other kind of pre-training models is sentence-
level. Howard and Ruder (2018) propose ULM-
FiT, a fine-tuning method for pre-trained language
model that achieves state-of-the-art results on six
widely studied text classification datasets. More
recently, pre-trained language models have shown
to be useful in learning common language rep-
resentations by utilizing a large amount of unla-
beled data: e.g., OpenAI GPT (Radford et al.,
2018) and BERT (Devlin et al., 2018). BERT is
based on a multi-layer bidirectional Transformer
(Vaswani et al., 2017) and is trained on plain text
for masked word prediction and next sentence pre-
diction tasks.
Although BERT has achieved amazing results
in many natural language understanding (NLU)
tasks, its potential has yet to be fully explored.
There is little research to enhance BERT to im-
prove the performance on target tasks further.
In this paper, we investigate how to maximize
the utilization of BERT for the text classifica-
tion task. We explore several ways of fine-tuning
BERT to enhance its performance on text classifi-
cation task. We design exhaustive experiments to
make a detailed analysis of BERT.
The contributions of our paper are as follows:
• We propose a general solution to fine-tune
the pre-trained BERT model, which includes
three steps: (1) further pre-train BERT on
within-task training data or in-domain data;
(2) optional fine-tuning BERT with multi-
task learning if several related tasks are avail-
able; (3) fine-tune BERT for the target task.
• We also investigate the fine-tuning meth-
ods for BERT on target task, including pre-
process of long text, layer selection, layer-
wise learning rate, catastrophic forgetting,
arXiv:1905.05583v3 [cs.CL] 5 Feb 2020