PDF 32000-1:2008
12 © Adobe Systems Incorporated 2008 – All rights reserved
level syntactic entities, principally objects, which are the basic data values from which a PDF document is
constructed.
A non-encrypted PDF can be entirely represented using b
yte values corresponding to the visible printable
subset of the character set defined in ANSI X3.4-1986, plus white space characters. However, a PDF file is not
restricted to the ASCII character set; it may contain arbitrary bytes, subject to the following considerations:
• The tokens that delimit objects and that describe the str
ucture of a PDF file shall use the ASCII character
set. In addition all the reserved words and the names used as keys in PDF standard dictionaries and
certain types of arrays shall be defined using the ASCII character set.
• The data values of strings and streams objects may be
written either entirely using the ASCII character set
or entirely in binary data. In actual practice, data that is naturally binary, such as sampled images, is
usually represented in binary for compactness and efficiency.
• A PDF file containing binary data shall be transported as a
binary file rather than as a text file to insure that
all bytes of the file are faithfully preserved.
NOTE 1 A binary file is not portable to environments that impose reserved character codes, maximum line lengths, end-
of-line conventions, or other restrictions
NOTE 2 In this clause, the usage of the term character is e
ntirely independent of any logical meaning that the value
may have when it is treated as data in specific contexts, such as representing human-readable text or
selecting a glyph from a font.
7.2.2 Character Set
The PDF character set is divided into three classes, called re
gular, delimiter, and white-space characters. This
classification determines the grouping of characters into tokens. The rules defined in this sub-clause apply to
all characters in the file except within strings, streams, and comments.
The White-
space characters shown in Table 1 separate syntactic constructs such as names and numbers from
each other. All white-space characters are equivalent, except
in comments, strings, and streams. In all other
contexts, PDF treats any sequence of consecutive white-space characters as one character.
The CARRIAGE RETURN (0Dh) and LINE FEED (0
Ah) characters, also called newline characters, shall be
treated as end-of-line (EOL) markers. The combination of a CARRIAGE RETURN followed immediately by a
LINE FEED shall be treated as one EOL marker. EOL markers may be treated the same as any other white-
space characters. However, sometimes an EOL marker is required or recommended—that is, preceding a
token that must appear at the beginning of a line.
NOTE The examples in this standard use a convention that arranges tokens into lines. However, the examples’ use of
white space for indentation is purely for clarity of exposition and need not be included in practical use.
Table 1 – White-space characters
Decimal Hexadecimal Octal Name
0 00 000 Null (NUL)
9 09 011 HORIZONTAL TAB (HT)
10 0A 012 LINE FEED (LF)
12 0C 014 FORM FEED (FF)
13 0D 015 CARRIAGE RETURN (CR)
32 20 040 SPACE (SP)