[Python-Dev] Divorcing str and unicode (no more implicit conversions).

"Martin v. Löwis" martin at v.loewis.de
Tue Oct 25 23:21:43 CEST 2005


Guido van Rossum wrote:
> Yes but why? What does this invariant do for him?

I don't know about this person, but there are a few things that
don't work properly in UTF-16 mode:

- the Unicode character database fails to lookup things.
   u"\U0001D670".isupper() gives false, but should give true
   (since it denotes MATHEMATICAL MONOSPACE CAPITAL A).
   It gives true in UCS-4 mode
- As a result, normalization on these doesn't work, either.
   It should normalize to "LATIN CAPITAL LETTER A" under
   NFKC, but doesn't.
- regular expressions only have limited support. In
   particular, adding non-BMP characters to character classes
   is not possible. [\U0001D670] will match any character
   that is either \uD835 or \uDE70, whereas it only matches
   MATHEMATICAL MONOSPACE CAPITAL A in UCS-4 mode.

There might be more limitations, but those are the ones that
come to mind easily. While I could imagine fixing the first
two with some effort, the third one is really tricky (unless
you would accept a "wide" representation of a character
class even if the Unicode representation is only narrow).

Regards,
Martin


More information about the Python-Dev mailing list