IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020 4
Precision measures the ability of a N ER system to present
only correct entities, and Recall measures the ability of a
NER system to recognize all entities in a corpus.
Precision =
T P
T P + F P
Recall =
T P
T P + F N
F-score is the harmonic mean of precision a nd recall, and
the balanced F-score is most commonly u sed:
F-score = 2 ×
Precision × Recall
Precision + Recall
As most of NER systems involve multiple entity types,
it is often required to assess the performance across all
entity classes. Two measures are commonly used for this
purpose: macro-averaged F-score and micro-averaged F-
score. Macro-averaged F-score computes the F-score inde-
pendently for each entity type, then takes the average (hence
treating all entity types equally). Micro-a veraged F-score
aggregates t he contributions of entities from all classes to
compute the average (treating all entities equally). The latter
can be hea vily affected by the quality of recognizing e nt ities
in large classes in the corpus.
2.3.2 Relaxed-match Evaluation
MUC-6 [
10] defines a relaxed-match evaluation: a correct
type is credited if an entity is assigned its correct type
regardless its boundaries as long as there is an overlap
with ground truth boundaries; a correct boundary is cred-
ited regardless an e nt ity’s type ass ignme nt . Then ACE [
12]
proposes a more comp le x evaluation procedure. It resolves a
few issues like partial match and wrong type, and considers
subtypes of named entitie s. However, it is problematic be-
cause the final scores are comparable only when parameters
are fixed [
1], [22], [23]. Complex evaluation methods are not
intuitive and make error analysis difficult. Thus, complex
evaluation methods are not widely used in recent studies.
2.4 Traditional Approaches to NER
Traditional approaches to NER a re broadly classified into
three main streams: rule-based, unsupervised learning, and
feature-based supervised learning approaches [
1], [26].
2.4.1 Rule-based Approaches
Rule-based NER systems rely on hand-crafted rules. Rules
can be designed based on domain-sp e cific gazetteers [
9],
[
42] and syntactic-lexical patterns [43]. Kim [44] proposed
to use Brill rule inference approach for speech input. This
system generates rules automatically based on Brill’s part-
of-speech t agger. In biomedical domain, Hanisch et al. [
45]
proposed ProMiner, which leverages a pre-processed syn-
onym dictionary to identify protein mentions and potential
gene in biomedical text. Quimbaya et al. [
46] proposed
a dictionary-based approach for NER in electronic health
records. Experimental results show the approach improves
recall while having limited impact on precision.
Some other well-known rule-based NER syst e m s in-
clude LaSIE-II [
47], NetOwl [48], Facile [49], SAR [50],
FASTUS [51], and LTG [52] systems. These systems are
mainly ba sed on ha nd-crafted sem antic and syntactic rules
to recognize entities. Rule-based systems work very well
when lex icon is ex haustive. Due to domain-specific rules
and incomplete dictionaries, high precision and low recall
are often observed from such systems, and the systems
cannot be transferred to other domains.
2.4.2 Unsupervised Learning Approaches
A typical approach of unsupervised learning is cluster-
ing [
1]. Clustering-based NER systems extract named en-
tities from the clustered groups based on context similarity.
The key idea is tha t lexical resources, lexical patterns, and
statistics computed on a large corpus can be used to infer
mentions of named entities. Collins et al. [53] observed
that use of unlabeled data reduces the requirements for
supervision to just 7 simple “seed” rules. The authors t hen
presented t wo unsupervised algorithms for name d entity
classification. Similarly, KNOWITALL [
9] leveraged a set
of predicate names as input and bootstraps its recognition
process from a small set of generic extraction pa tterns.
Nadeau et al. [
54] proposed an unsupervised system for
gazetteer building and named entity ambiguity resolution.
This system combines entity extraction and disambiguation
based on simple yet highly effective heuristics. In addi-
tion, Zhang and Elhadad [
43] proposed an unsupervised
approach to extracting named entities from biomedical text.
Instead of supervision, their model resorts to terminolo-
gies, corpus statistics (e.g., inverse document frequency
and context vectors) and shallow syntactic knowledge (e.g.,
noun phrase chunking). Experiments on two mainstream
biomedical datasets demonstrate the effectiveness and gen-
eralizability of their unsupervised approach.
2.4.3 Feature-based Supervised Learning Approaches
Applying supervised learning, NER is cast to a multi-class
classification or sequence labeling task. Given annotated
data samples, features are carefully designed to represent
each training example. Machine learning algorithms are
then utilized to learn a model to recognize similar patterns
from unseen data.
Feature engineering is critical in supervised NER sys-
tems. Feature vector representation is an a bstraction over
text where a word is represented by one or many Boolean,
numeric, or nominal values [
1], [55]. Word-level features
(e.g., case, morphology, and part-of-speech tag) [
56]–[58],
list lookup features (e.g. , Wikipedia gazetteer and DBpedia
gazetteer) [
59]–[62], and document and corpus features (e.g.,
local syntax and multiple occurrences) [
63]–[66] ha ve been
widely used in various supervised NER systems. More
feature designs a re discussed in [
1], [28], [67]
Based on these features, many machine learning algo-
rithms ha ve been applied in supervised NER, including
Hidden Markov Models (HMM) [
68], Decision Trees [69],
Maximum Entrop y Models [
70], Support Vector Machines
(SVM) [
71], and Conditional Random Fields (CRF) [72].
Bikel et al. [73], [74] proposed the first HMM-based NER
system, named IdentiFinder, t o identify a nd classify names,
dates, time expressions, and numerical quant ities. In addi-
tion, Szarvas e t al. [
75] developed a multilingual NER sys-
tem by using C4.5 decision tree and AdaBoostM1 learning
algorithm. A major merit is that it provides an opportunity
to t rain several independent decision tree classifiers through
different subsets of features then combine t heir decisions