[I18n-sig] Re: [Python-Dev] Unicode debate
M.-A. Lemburg
mal@lemburg.com
Tue, 02 May 2000 17:18:21 +0200
Just van Rossum wrote:
>
> At 8:30 AM -0400 02-05-2000, Guido van Rossum wrote:
> >I think /F's point was that the Unicode standard prescribes different
> >behavior here: for UTF-8, a missing or lone continuation byte is an
> >error; for Unicode, accents are separate characters that may be
> >inserted and deleted in a string but whose display is undefined under
> >certain conditions.
> >
> >(I just noticed that this doesn't work in Tkinter but it does work in
> >wish. Strange.)
> >
> >> FYI: Normalization is needed to make comparing Unicode
> >> strings robust, e.g. u"È" should compare equal to u"e\u0301".
^
|
Here's a good example of what encoding errors can do: the
above character was an "e" with acute accent (u"é"). Looks like
some mailer converted this to some other code page and yet
another back to Latin-1 again and this even though the
message header for Content-Type clearly states that the
document uses ISO-8859-1.
> >
> >Aha, then we'll see u == v even though type(u) is type(v) and len(u)
> >!= len(v). /F's world will collapse. :-)
>
> Does the Unicode spec *really* specifies u should compare equal to v?
The behaviour is needed in order to implement sorting Unicode.
See the www.unicode.org site for more information and the
tech reports describing this.
Note that I haven't mentioned anything about "automatic"
normalization. This should be a method on Unicode strings
and could then be used in sorting compare callbacks.
--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/