单核每秒解析吉字节级JSON：simdjson性能优化

139 浏览量更新于2024-07-14 收藏 455KB PDF 举报

本文档探讨了"每秒解析吉字节级JSON"这一关键的计算机科学问题，由Geoff Langdale和Daniel Lemire两位作者在2019年2月撰写（论文编号：1902.08318）。JSON，即JavaScript对象表示法，作为一种在互联网上广泛使用的数据交换格式，其处理大量数据时可能成为性能瓶颈。随着大数据时代的到来，对高效JSON解析的需求日益增长。尽管JSON解析问题已经相当成熟，但作者强调仍有提升空间。他们提出了一种创新的标准兼容JSON解析器——simdjson，能够在单核商用处理器上实现实时处理每秒数百兆甚至更高量级的数据，速度远超当时主流的参考解析器如RapidJSON。该解析器的一个关键特性是充分利用了单指令多数据（SIMD）指令集，这种并行计算能力显著提高了解析效率。与验证性解析器不同，simdjson作为开源软件，遵循宽松的许可证，旨在确保代码的可复用性和社区参与。作者的目标不仅是提高性能，还在于确保技术的开放性和透明度。在介绍部分，作者提到了JSON在浏览器与服务器通信中的广泛应用，以及数据库系统如MySQL、PostgreSQL、IBM DB2、SQL Server和Oracle等对其的支持。这篇论文不仅关注了技术细节，如SIMD指令如何优化解析过程，还讨论了JSON在现代互联网架构中的核心作用和优化解析器的重要性。它对于那些处理大规模数据处理的开发者来说，提供了有价值的参考和实践指导，推动了JSON解析技术的发展。

minated with a closing brace (‘}’). We ensure that

all arrays started with an open square bracket (‘[’)

are terminated with a closing square bracket ( ‘]’).

The result is written in document order on a tape:

an array of 64-bit words. The tape contains a word

for each node value (string, number, true, false, null)

and a word at the beginning and at the end of each

object or array. To ensure fast navigation, the words

on the tape corresponding to braces or brackets are

annotated so that we can go from the word at the

start of an object or array to the word at the end of

the array without reading the content of the array

or object.

We have a secondary array where normalized string

values are stored. Other parsers like RapidJSON or

sajson may store the normalized strings directly in

the input bytes.

At the end of the two stages, we report whether the

JSON document is valid [4]. All strings are normalized

and all numbers have been parsed and validated.

Our two-stage design is motivated by performance

concerns. Stage 1 operates directly on the input bytes,

processing the data in batches of 64 bytes. In this man-

ner, we can make full use of the SIMD instructions that

are key to our good performance. Except for unicode

validation, we deliberately delay number and string val-

idation to stage 2, as these tasks are comparatively

expensive and diﬃcult to perform unconditionally and

cheaply over our entire input.

3.1 Stage 1: Structural and Pseudo-Structural

Elements

The ﬁrst stage of our processing must identify key points

in our input: the structural characters of JSON (brace,

bracket, colon and comma), the start and end of strings

as delineated by double quote characters, other JSON

atoms that are not distinguishable by simple charac-

ters ( true, false, null and numbers), as well as dis-

covering these characters and atoms in the presence of

both quoting conventions and backslash escaping con-

ventions.

In JSON, a ﬁrst pass over the input can eﬃciently

discover the signiﬁcant characters that delineate syntac-

tic elements (objects and arrays). Unfortunately, these

characters may also appear between quotes, so we need

to identify quotes. It is also necessary to identify the

backslash character because JSON allows escaped char-

acters: ‘\”’, ‘\\’, ‘\/’, ‘\b’, ‘\f’, ‘\n’, ‘\r’, ‘\t’, as well

as escaped unicode characters (e.g. \uDD1E).

A point of reference is Mison [12], a fast parser in

C++. Mison uses vector instructions to identify the

colons, braces, quotes and backslashes. The detected

quotes and backslashes are used to ﬁlter out the in-

signiﬁcant colons and braces. We follow the broad out-

line of the construction of a structural index as set forth

in Mison; ﬁrst, the discovery of odd-length sequences

of backslash characters—which will cause quote char-

acters immediately following to be escaped and not

serve their quoting role but instead be literal charac-

ters, second, the discovery of quote pairs—which cause

structural characters within the quote pairs to also be

merely literal characters and have no function as struc-

tural characters, then ﬁnally the discovery of structural

characters not contained within the quote pairs. We

depart from the Mison paper in method and overall de-

sign. The Mison authors loop over the results of their

initial SIMD identiﬁcation of characters, while we pro-

pose branchless sequences to accomplish similar tasks.

For example, to locate escaped quote characters, they

iterate over the repeated quote characters. Their Al-

gorithm 1 identiﬁes the location of the quoted charac-

ters by iterating through the unescaped quote charac-

ters. We have no such loops in our stage 1: it is es-

sentially branchless, with a ﬁxed cost per input bytes

(except for character-encoding validation, § 3.1.5). Fur-

thermore, Mison’s processing is more limited by design

as it does not identify the locations of the atoms, it does

not process the white-space characters and it does not

validate the character encoding.

3.1.1 Identiﬁcation of the quoted substrings

Identifying escaped quotes is less trivial than it appears.

While it is easy to recognize that the string “\"” is made

of an escaped quote since a quote character immediately

preceded by a backslash, if a quote is preceded by an

even number of backslashes (e.g., “\\"”), then it is not

escaped since \\ is an escaped backslash. We distinguish

sequences of backslash characters starting at an odd

index location from sequences starting at even index

location. A sequence of characters that starts at an odd

(resp. even) index location and ends at an odd (resp.

even) index location must have an even length, and it is

therefore a sequence of escaped backslashes. Otherwise,

the sequence contains an odd number of backslashes

and any quote character following it must be considered

escaped. We provide the code sequence with an example

in Fig. 2 where two quote characters are escaped.

We simplify this sequence for clarity. Our results are af-

fected by the previous iteration over the preceding 64 byte

input if any. Suppose a single backslash ended the previous

64 byte input; this alters the results of the previous algorithm.

We similarly elide the full details of the adjustments for pre-

vious loop state in our presentation of subsequent algorithms.

剩余16页未读，继续阅读

weixin_38745361

粉丝: 3
资源: 879

单核每秒解析吉字节级JSON：simdjson性能优化

exp-schp-201908301523-atr.pth

lloyd-yajl-2.1.0-2-g12ee82a.zip

openstack-vitrage-snmp-parsing-5.0.1-1.el8.noarch.rpm

openstack-vitrage-snmp-parsing-3.2.0-1.el7.noarch.rpm

openstack-vitrage-snmp-parsing-4.3.1-1.el7.noarch.rpm

openstack-vitrage-snmp-parsing-3.3.0-1.el7.noarch.rpm

openstack-vitrage-snmp-parsing-5.0.1-1.el7.noarch.rpm

openstack-vitrage-snmp-parsing-5.0.0-1.el7.noarch.rpm

openstack-vitrage-snmp-parsing-2.2.0-1.el7.noarch.rpm

openstack-vitrage-snmp-parsing-2.3.0-1.el7.noarch.rpm

最新资源