and that is how all those lessons (and blogs, forums, tweets, etc.) are being communi
cated. The Web contains information in all forms of media—including texts, images,
movies, and sounds—and language is the communication medium that allows people
to understand the content, and to link the content to other media. However, while com
puters are excellent at delivering this information to interested users, they are much less
adept at understanding language itself.
Theoretical and computational linguistics are focused on unraveling the deeper nature
of language and capturing the computational properties of linguistic structures. Human
language technologies (HLTs) attempt to adopt these insights and algorithms and turn
them into functioning, high-performance programs that can impact the ways we in
teract with computers using language. With more and more people using the Internet
every day, the amount of linguistic data available to researchers has increased signifi
cantly, allowing linguistic modeling problems to be viewed as ML tasks, rather than
limited to the relatively small amounts of data that humans are able to process on their
own.
However, it is not enough to simply provide a computer with a large amount of data and
expect it to learn to speak—the data has to be prepared in such a way that the computer
can more easily find patterns and inferences. This is usually done by adding relevant
metadata to a dataset. Any metadata tag used to mark up elements of the dataset is called
an annotation over the input. However, in order for the algorithms to learn efficiently
and effectively, the annotation done on the data must be accurate, and relevant to the
task the machine is being asked to perform. For this reason, the discipline of language
annotation is a critical link in developing intelligent human language technologies.
Giving an ML algorithm too much information can slow it down and
lead to inaccurate results, or result in the algorithm being so molded to
the training data that it becomes “overfit” and provides less accurate
results than it might otherwise on new data. It’s important to think
carefully about what you are trying to accomplish, and what informa
tion is most relevant to that goal. Later in the book we will give examples
of how to find that information, and how to determine how well your
algorithm is performing at the task you’ve set for it.
Datasets of natural language are referred to as corpora, and a single set of data annotated
with the same specification is called an annotated corpus. Annotated corpora can be
used to train ML algorithms. In this chapter we will define what a corpus is, explain
what is meant by an annotation, and describe the methodology used for enriching a
linguistic data collection with annotations for machine learning.
2 | Chapter 1: The Basics