[Python-Dev] Unicode howto in the works - feedback appreciated
Stephen J. Turnbull
01 May 2002 16:52:13 +0900
>>>>> "Skip" == Skip Montanaro <firstname.lastname@example.org> writes:
Skip> I began working on a Unicode HOWTO a few weeks ago, got a
Skip> little ways on it, then ignored it until this morning. I
Skip> added a little bit more to it then decided I should get some
Skip> feedback. You can view it at
A few comments.
Overall, I like this intro. Technically it's horrible<wink> but I
think it will hit your target audience where they live.
[What Is Unicode?]
1. Characters are "atomic units of text" that have properties. Since
they're atoms, we represent them by integers in computer programs.
Among the properties are their glyphs (graphical representation),
classes (alpha, num, whitespace, etc), and so on. It is a bad
idea to identify characters with their glyphs.
2. Alphabets are abstract sets of characters. Coded character sets
map characters to integer representations. "Encoding" is a
reasonable synonym for "coded character set". Avoid "charset"
except when talking about the charset parameter of Content-Type.
3. Typo in last sentence "I will suggest that YOU should use UTF-8."
1. Most programming languages are restricted to ASCII, except perhaps
for user-defined identifiers. This means that programming tools
need only be 8-bit clean to handle UTF without corruption.
2. Space efficiency is _not_ an advantage of UTF-8 vs. UTF-16. ASCII
and most Western European languages, yes. Greek, Hebrew, Arabic
or Russian will be nearly a wash (whitespace, punctuation, and
numerals give you what savings you're gonna get), and everybody
east of Eden takes a 50% hit. The real tradeoff is "string ==
array of fixed-width object" semantics vs upward compatibility from
ASCII for languages where most tokens contain only ASCII.
1. If you don't get a Content-Type charset parameter, you _must_ assume
[Mildly Corrupt Data]
1. You can expect people to develop libraries for this kind of thing,
but they are unlikely to be distributed. Suggest that newbies ask
 This isn't quite true; consider the Lisp ?A notation for
character literals. A naive byte-oriented parser will pick up only
the leading byte of a non-ASCII UTF-8 character, and probably choke
fatally on the trailing bytes. But Python, C, Java, et al don't have
such literals---tokens with delimiters that are ASCII characters are
safe, both strings and identifiers. You can ignore this issue.
 Which UTF-16 actually doesn't give you! Grrr.
Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
My nostalgia for Icon makes me forget about any of the bad things. I don't
have much nostalgia for Perl, so its faults I remember. Scott Gilbert c.l.py