[Python-Dev] UTF-16 code point comparison

Fredrik Lundh Fredrik Lundh" <effbot@telia.com
Thu, 27 Jul 2000 16:05:25 +0200


> [me]
> > To summarize, here's the "character encoding guidelines" for
> > Python 2.0:
> >=20
> >     In Unicode context, 8-bit strings contain ASCII. Characters
> >     in the 0x80-0xFF range are invalid.  16-bit strings contain
> >     UCS-2.  Characters in the 0xD800-0xDFFF range are invalid.
> >=20
> >     If you want to use any other encoding, use the codecs pro-
> >     vided by the Unicode subsystem.  If you need to use Unicode
> >     characters that cannot be represented as UCS-2, you cannot
> >     use Python 2.0's Unicode subsystem.
> >=20
> > Anything else is just a hack.

[guido]
> I wouldn't go so far as raising an exception when a comparison
> involves 0xD800-0xDFFF; after all we don't raise an exception when an
> ASCII string contains 0x80-0xFF either (except when converting to
> Unicode).

that's what the "unicode context" qualifier means: 8-bit strings
can contain anything, unless you're using them as unicode strings.

> The invalidity of 0xD800-0xDFFF means that these aren't valid Unicode
> code points; it doesn't mean that we should trap all attempts to use
> these values.  That ways, apps that need UTF-16 awareness can code it
> themselves.
>=20
> Why?  Because I don't want to proliferate code that explicitly traps
> 0xD800-0xDFFF throughout the code.

you only need to check "on the way in", and leave it to the
decoders to make sure they generate UCS-2 only.

the original unicode implementation did just that, but Bill and
Marc-Andre recently lowered the shields: the UTF-8 decoder
now generates UTF-16 encoded data.  (so does \N{}, but
that's a non-issue:=20

(oddly enough, the UTF-16 decoder won't accept anything
that isn't UCS-2 ;-)

my proposal is to tighten this up in 2.0: ifdef out the UTF-16
code in the UTF-8 decoder, and ifdef out the UTF-16 stuff in
the compare function.

let's wait until 2.1 before we support the full unicode character
set (and I'm pretty sure "the right way" is UCS-4 storage and a
unified string implementation, but let's discuss that later).

patch coming.

</F>