[Python-Dev] New Py_UNICODE doc

Wed May 11 09:18:59 CEST 2005

On May 10, 2005, at 7:34 PM, James Y Knight wrote:
> If you're going to call python's implementation UTF-16, I'd consider 
> all these very serious deficiencies:

The --enable-unicode option declares a character encoding form (CEF), 
not a character encoding scheme (CES).  It is unfortunate that UTF-16 
is a valid option for both of these things, but supporting the CEF does 
not imply supporting the CES.  All of your complaints would be valid if 
we claimed that Python supported the UTF-16 CES, but the language 
itself only needs to support a CEF that everyone understands how to 
work with.

It is widely recognized, I believe, that the general level of unicode 
support exposed to Python users is somewhat lacking when it comes to 
high surrogate pairs.  I'd love for us to fix that problem, or, better 
yet, integrate something like ICU, but this isn't that discussion.

> - unicodedata doesn't work for 2-char strings containing a surrogate 
> pairs, nor integers. Therefore it is impossible to get any data on 
> chars > 0xFFFF.
> - there are no methods for determining if something is a surrogate 
> pair and turning it into a integer codepoint.
> - Given that unicodedata doesn't work, I doubt also that .toupper/etc 
> work right on surrogate pairs, although I haven't tested.
> - As has been noted before, the regexp engine doesn't properly treat 
> surrogate pairs as a single unit.
> - Is there a method that is like unichr but that will work for 
> codepoints > 0xFFFF?
>
> I'm sure there's more as well. I think it's a mistake to consider 
> python to be implementing UTF-16 just because it properly 
> encodes/decodes surrogate pairs in the UTF-8 codec.

Users should understand (and we should write doc to help them 
understand), that using 2-byte wide unicode support in Python means 
that all operations will be done on Code Units, and not Code Points.  
Once you understand this, you can work with the data that is given to 
you, although it's certainly not as nice as what you would have come to 
expect from Python.  (For example, you can correctly construct a regexp 
to find the surrogate pair you're looking for by using the constituent 
code units).

--
Nick