[I18n-sig] Re: [Python-Dev] Unicode debate

M.-A. Lemburg mal@lemburg.com
Tue, 02 May 2000 10:36:43 +0200

Just a small note on the subject of a character being atomic
which seems to have been forgotten by the discussing parties:

Unicode itself can be understood as multi-word character
encoding, just like UTF-8. The reason is that Unicode entities
can be combined to produce single display characters (e.g.
u"e"+u"\u0301" will print "" in a Unicode aware renderer).
Slicing such a combined Unicode string will have the same
effect as slicing UTF-8 data.

It seems that most Latin-1 proponents seem to have single
display characters in mind. While the same is true for
many Unicode entities, there are quite a few cases of
combining characters in Unicode 3.0 and the Unicode
nomarization algorithm uses these as basis for its

So in the end the "UTF-8 doesn't slice" argument holds for
Unicode itself too, just as it also does for many Asian
multi-byte variable length character encodings,
image formats, audio formats, database formats, etc.

You can't really expect slicing to always "just work"
without some knowledge about the data you are slicing.

Marc-Andre Lemburg
