In the top constituent task (TopConst), sen-
tences must be classified in terms of the sequence
of top constituents immediately below the sen-
tence (S) node. An encoder that successfully ad-
dresses this challenge is not only capturing latent
syntactic structures, but clustering them by con-
stituent types. TopConst was introduced by Shi
et al. (2016). Following them, we frame it as a
20-way classification problem: 19 classes for the
most frequent top constructions, and one for all
other constructions. As an example, “[Then] [very
dark gray letters on a black screen] [appeared] [.]”
has top constituent sequence: “ADVP NP VP .”.
Note that, while we would not expect an un-
trained human subject to be explicitly aware of
tree depth or top constituency, similar information
must be implicitly computed to correctly parse
sentences, and there is suggestive evidence that the
brain tracks something akin to tree depth during
sentence processing (Nelson et al., 2017).
Semantic information These tasks also rely on
syntactic structure, but they further require some
understanding of what a sentence denotes. The
Tense task asks for the tense of the main-clause
verb (VBP/VBZ forms are labeled as present,
VBD as past). No target form occurs across the
train/dev/test split, so that classifiers cannot rely
on specific words (it is not clear that Shi and col-
leagues, who introduced this task, controlled for
this factor). The subject number (SubjNum) task
focuses on the number of the subject of the main
clause (number in English is more often explic-
itly marked on nouns than verbs). Again, there
is no target overlap across partitions. Similarly,
object number (ObjNum) tests for the number of
the direct object of the main clause (again, avoid-
ing lexical overlap). To solve the previous tasks
correctly, an encoder must not only capture tense
and number, but also extract structural informa-
tion (about the main clause and its arguments).
We grouped Tense, SubjNum and ObjNum with
the semantic tasks, since, at least for models that
treat words as unanalyzed input units (without
access to morphology), they must rely on what
a sentence denotes (e.g., whether the described
event took place in the past), rather than on struc-
tural/syntactic information. We recognize, how-
ever, that the boundary between syntactic and se-
mantic tasks is somewhat arbitrary.
In the semantic odd man out (SOMO) task, we
modified sentences by replacing a random noun
or verb o with another noun or verb r. To make
the task more challenging, the bigrams formed by
the replacement with the previous and following
words in the sentence have frequencies that are
comparable (on a log-scale) with those of the orig-
inal bigrams. That is, if the original sentence con-
tains bigrams w
n−1
o and ow
n+1
, the correspond-
ing bigrams w
n−1
r and rw
n+1
in the modified
sentence will have comparable corpus frequencies.
No sentence is included in both original and modi-
fied format, and no replacement is repeated across
train/dev/test sets. The task of the classifier is to
tell whether a sentence has been modified or not.
An example modified sentence is: “ No one could
see this Hayes and I wanted to know if it was
real or a spoonful (orig.: ploy).” Note that judg-
ing plausibility of a syntactically well-formed sen-
tence of this sort will often require grasping rather
subtle semantic factors, ranging from selectional
preference to topical coherence.
The coordination inversion (CoordInv) bench-
mark contains sentences made of two coordinate
clauses. In half of the sentences, we inverted the
order of the clauses. The task is to tell whether
a sentence is intact or modified. Sentences
are balanced in terms of clause length, and no
sentence appears in both original and inverted
versions. As an example, original “They might
be only memories, but I can still feel each one”
becomes: “I can still feel each one, but they might
be only memories.” Often, addressing CoordInv
requires an understanding of broad discourse and
pragmatic factors.
Row Hum. Eval. of Table 2 reports human-
validated “reasonable” upper bounds for all the
tasks, estimated in different ways, depending on
the tasks. For the surface ones, there is always a
straightforward correct answer that a human an-
notator with enough time and patience could find.
The upper bound is thus estimated at 100%. The
TreeDepth, TopConst, Tense, SubjNum and Ob-
jNum tasks depend on automated PoS and pars-
ing annotation. In these cases, the upper bound
is given by the proportion of sentences correctly
annotated by the automated procedure. To esti-
mate this quantity, one linguistically-trained au-
thor checked the annotation of 200 randomly sam-
pled test sentences from each task. Finally, the
BShift, SOMO and CoordInv manipulations can
accidentally generate acceptable sentences. For