[Python-Dev] New Py_UNICODE doc
Nicholas Bastin
nbastin at opnet.com
Wed May 11 09:18:59 CEST 2005
On May 10, 2005, at 7:34 PM, James Y Knight wrote:
> If you're going to call python's implementation UTF-16, I'd consider
> all these very serious deficiencies:
The --enable-unicode option declares a character encoding form (CEF),
not a character encoding scheme (CES). It is unfortunate that UTF-16
is a valid option for both of these things, but supporting the CEF does
not imply supporting the CES. All of your complaints would be valid if
we claimed that Python supported the UTF-16 CES, but the language
itself only needs to support a CEF that everyone understands how to
work with.
It is widely recognized, I believe, that the general level of unicode
support exposed to Python users is somewhat lacking when it comes to
high surrogate pairs. I'd love for us to fix that problem, or, better
yet, integrate something like ICU, but this isn't that discussion.
> - unicodedata doesn't work for 2-char strings containing a surrogate
> pairs, nor integers. Therefore it is impossible to get any data on
> chars > 0xFFFF.
> - there are no methods for determining if something is a surrogate
> pair and turning it into a integer codepoint.
> - Given that unicodedata doesn't work, I doubt also that .toupper/etc
> work right on surrogate pairs, although I haven't tested.
> - As has been noted before, the regexp engine doesn't properly treat
> surrogate pairs as a single unit.
> - Is there a method that is like unichr but that will work for
> codepoints > 0xFFFF?
>
> I'm sure there's more as well. I think it's a mistake to consider
> python to be implementing UTF-16 just because it properly
> encodes/decodes surrogate pairs in the UTF-8 codec.
Users should understand (and we should write doc to help them
understand), that using 2-byte wide unicode support in Python means
that all operations will be done on Code Units, and not Code Points.
Once you understand this, you can work with the data that is given to
you, although it's certainly not as nice as what you would have come to
expect from Python. (For example, you can correctly construct a regexp
to find the surrogate pair you're looking for by using the constituent
code units).
--
Nick
More information about the Python-Dev
mailing list