minated with a closing brace (‘}’). We ensure that
all arrays started with an open square bracket (‘[’)
are terminated with a closing square bracket ( ‘]’).
The result is written in document order on a tape:
an array of 64-bit words. The tape contains a word
for each node value (string, number, true, false, null)
and a word at the beginning and at the end of each
object or array. To ensure fast navigation, the words
on the tape corresponding to braces or brackets are
annotated so that we can go from the word at the
start of an object or array to the word at the end of
the array without reading the content of the array
or object.
We have a secondary array where normalized string
values are stored. Other parsers like RapidJSON or
sajson may store the normalized strings directly in
the input bytes.
At the end of the two stages, we report whether the
JSON document is valid [4]. All strings are normalized
and all numbers have been parsed and validated.
Our two-stage design is motivated by performance
concerns. Stage 1 operates directly on the input bytes,
processing the data in batches of 64 bytes. In this man-
ner, we can make full use of the SIMD instructions that
are key to our good performance. Except for unicode
validation, we deliberately delay number and string val-
idation to stage 2, as these tasks are comparatively
expensive and difficult to perform unconditionally and
cheaply over our entire input.
3.1 Stage 1: Structural and Pseudo-Structural
Elements
The first stage of our processing must identify key points
in our input: the structural characters of JSON (brace,
bracket, colon and comma), the start and end of strings
as delineated by double quote characters, other JSON
atoms that are not distinguishable by simple charac-
ters ( true, false, null and numbers), as well as dis-
covering these characters and atoms in the presence of
both quoting conventions and backslash escaping con-
ventions.
In JSON, a first pass over the input can efficiently
discover the significant characters that delineate syntac-
tic elements (objects and arrays). Unfortunately, these
characters may also appear between quotes, so we need
to identify quotes. It is also necessary to identify the
backslash character because JSON allows escaped char-
acters: ‘\”’, ‘\\’, ‘\/’, ‘\b’, ‘\f’, ‘\n’, ‘\r’, ‘\t’, as well
as escaped unicode characters (e.g. \uDD1E).
A point of reference is Mison [12], a fast parser in
C++. Mison uses vector instructions to identify the
colons, braces, quotes and backslashes. The detected
quotes and backslashes are used to filter out the in-
significant colons and braces. We follow the broad out-
line of the construction of a structural index as set forth
in Mison; first, the discovery of odd-length sequences
of backslash characters—which will cause quote char-
acters immediately following to be escaped and not
serve their quoting role but instead be literal charac-
ters, second, the discovery of quote pairs—which cause
structural characters within the quote pairs to also be
merely literal characters and have no function as struc-
tural characters, then finally the discovery of structural
characters not contained within the quote pairs. We
depart from the Mison paper in method and overall de-
sign. The Mison authors loop over the results of their
initial SIMD identification of characters, while we pro-
pose branchless sequences to accomplish similar tasks.
For example, to locate escaped quote characters, they
iterate over the repeated quote characters. Their Al-
gorithm 1 identifies the location of the quoted charac-
ters by iterating through the unescaped quote charac-
ters. We have no such loops in our stage 1: it is es-
sentially branchless, with a fixed cost per input bytes
(except for character-encoding validation, § 3.1.5). Fur-
thermore, Mison’s processing is more limited by design
as it does not identify the locations of the atoms, it does
not process the white-space characters and it does not
validate the character encoding.
3.1.1 Identification of the quoted substrings
Identifying escaped quotes is less trivial than it appears.
While it is easy to recognize that the string “\"” is made
of an escaped quote since a quote character immediately
preceded by a backslash, if a quote is preceded by an
even number of backslashes (e.g., “\\"”), then it is not
escaped since \\ is an escaped backslash. We distinguish
sequences of backslash characters starting at an odd
index location from sequences starting at even index
location. A sequence of characters that starts at an odd
(resp. even) index location and ends at an odd (resp.
even) index location must have an even length, and it is
therefore a sequence of escaped backslashes. Otherwise,
the sequence contains an odd number of backslashes
and any quote character following it must be considered
escaped. We provide the code sequence with an example
in Fig. 2 where two quote characters are escaped.
1
1
We simplify this sequence for clarity. Our results are af-
fected by the previous iteration over the preceding 64 byte
input if any. Suppose a single backslash ended the previous
64 byte input; this alters the results of the previous algorithm.
We similarly elide the full details of the adjustments for pre-
vious loop state in our presentation of subsequent algorithms.
4