[Python-Dev] Divorcing str and unicode (no more implicit conversions).
"Martin v. Löwis"
martin at v.loewis.de
Tue Oct 25 23:21:43 CEST 2005
Guido van Rossum wrote:
> Yes but why? What does this invariant do for him?
I don't know about this person, but there are a few things that
don't work properly in UTF-16 mode:
- the Unicode character database fails to lookup things.
u"\U0001D670".isupper() gives false, but should give true
(since it denotes MATHEMATICAL MONOSPACE CAPITAL A).
It gives true in UCS-4 mode
- As a result, normalization on these doesn't work, either.
It should normalize to "LATIN CAPITAL LETTER A" under
NFKC, but doesn't.
- regular expressions only have limited support. In
particular, adding non-BMP characters to character classes
is not possible. [\U0001D670] will match any character
that is either \uD835 or \uDE70, whereas it only matches
MATHEMATICAL MONOSPACE CAPITAL A in UCS-4 mode.
There might be more limitations, but those are the ones that
come to mind easily. While I could imagine fixing the first
two with some effort, the third one is really tricky (unless
you would accept a "wide" representation of a character
class even if the Unicode representation is only narrow).
Regards,
Martin
More information about the Python-Dev
mailing list