A survey of Unicode compression January 30, 2004
(CR) must be preceded by a quote tag. The most annoying impediment to ASCII control-code
transparency is that 0C (form feed) must be quoted even though it isn’t used as an SCSU tag.
Form feeds appear in a wide variety of text files, including nearly all Internet RFCs.
Here is an example of a short sentence, using the Latin and Cyrillic scripts, encoded in SCSU:
“ M o s c o w ” i s М о с к в а .
05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E
In this example, the “curly” double quotation marks U+201C and U+201D are encoded using
an SQ4 tag (byte 05), signaling the use of static window 4, followed by a byte indicating the
offset of the desired character within the window. SCSU compressors usually encode non-ASCII
punctuation characters like the curly quotes by quoting from static windows instead of switching
to dynamic windows, because such characters usually appear in isolation and this approach is
more efficient than switching windows. For the Cyrillic characters, the encoder used an SC2 tag
(byte 12) to switch to dynamic window 2, which is pre-defined to the Cyrillic block starting at
U+0400.
This example shows how a string of 19 Unicode characters can be compressed to 22 bytes; the
same text in UTF-8 would require 29 bytes. Note that this sample string is only intended as a
demonstration of how SCSU works, not as a benchmark. Actual compression performance will
vary, depending on the input text.
Text encoded in SCSU can be prefixed by a signature, U+FEFF, also known as the byte order
mark or BOM. Like UTF-8, SCSU is a byte-oriented encoding
6
, so there is no need to indicate
the byte order, but a signature identifying the following text as being encoded in SCSU may be
useful in some situations. UTS #6 suggests the sequence 0E FE FF, which uses the SQU tag to
introduce an isolated Unicode-mode character without changing the state of the encoder or
decoder. This means the signature can be automatically added to or stripped from the
beginning of a text stream without affecting the surrounding text, just as in UTF-8. However,
prefixing a signature to an SCSU-encoded file means the file is no longer ASCII-transparent,
which interferes with XML suitability.
SCSU is not perfect; it has some disadvantages that limit its attractiveness as a compression
format. First, as mentioned above, there is the perception that a “good” SCSU encoder is
difficult to write. In fact, a “good enough” encoder that takes advantage of some of the features
of SCSU can still achieve good compression, and is less complex than many algorithms that have
found their way into commonly used software. Single-character look-ahead, used in the sample
encoders described at the end of this section, is often sufficient to achieve good compression.
Apprehension over the effort required to create an “optimal” encoder should not be a major
hindrance to the use of SCSU.
Another concern is that SCSU uses bytes in the ASCII control range, not only for tags in single-
byte mode but as part of ordinary characters in Unicode mode. This makes SCSU unsuitable for
protocols such as MIME (Multipurpose Internet Mail Extensions) that interpret these control
bytes without decoding them. BOCU-1, described in the next section, solves this particular
problem by avoiding the most important ASCII control bytes.
6
Even in “Unicode mode,” SCSU tags are interpreted as single bytes.
5