Unicode
J. Stanley Warford
Computer Science Department
Pepperdine University
Malibu, CA 90263
The first electronic computers were developed to perform mathematical calculations
with numbers. Eventually, they processed textual data as well, and the ASCII code be-
came a widespread standard for processing text with the Latin alphabet. As computer
technology spread around the world, text processing in languages with different alpha-
bets produced many incompatible systems. The Unicode Consortium was established
to collect and catalog all the alphabets of all the spoken languages in the world both cur-
rent and ancient as a first step toward a standard system for the worldwide interchange,
processing, and display of texts in these natural languages.
Strictly speaking, the standard organizes characters into scripts, not languages. It
is possible for one script to be used in multiple languages. For example, the extended
Latin script can be used for many European and American languages. Version 7.0 of the
Unicode standard has 123 scripts for natural language and 15 scripts for other symbols.
Examples of natural language scripts are Balinese, Cherokee, Egyptian Hieroglyphs,
Greek, Phoenician, and Thai. Examples of other symbols are Braille Patterns, Emoti-
cons, Mathematical Symbols, and Musical Symbols.
Each character in every script has a unique identifying number, usually written in
hexadecimal, and is called a code point. The hexadecimal number is preceded by “U+”
to indicate that it is a Unicode code point. Corresponding to a code point is a glyph,
which is the graphic representation of the symbol on the page or screen. For example,
in the Hebrew script the code point U+05D1 has the glyph
.
Figure 1 shows some example code points and glyphs in the Unicode standard.
The CJK Unified script is for the written languages of China, Japan, and Korea, which
share a common character set with some variations. There are tens of thousands of
characters in these Asian writing systems, all based on a common set of Han characters.
To minimize unnecessary duplication the Unicode Consortium merged the characters
into a single set of unified characters. This Han unification is an ongoing process carried
out by a group of experts from the Chinese-speaking countries, North and South Korea,
Japan, Vietnam, and other countries.
Code points are backward compatible with ASCII. For example, from the ASCII
table the Latin letter S is stored with seven bits as 101 0011 (bin), which is 63 (hex).
So, the Unicode code point for S is U+0063. The standard requires at least four hex
digits following U+, padding the number with leading zeros if necessary.
A single code point can have more than one glyph. For example, an Arabic letter
1