Python's handling of unicode surrogates

"Martin v. Löwis" martin at v.loewis.de
Mon Apr 23 07:24:21 CEST 2007


> The Unicode standard doesn't require that you support surrogates, or
> any other kind of character, so no you wouldn't be lying.

There is the notion of Unicode implementation levels, and each of them
does include a set of characters to support. In level 1, combining
characters need not to be supported (which is sufficient for scripts
that can be represented without combining characters, such as Latin
and Cyrillic, using precomposed characters if necessary). In level 2,
combining characters must be supported for some scripts that absolutely
need them, and in level 3, all characters must be supported.

It is probably an interpretation issue what "supported" means. Python
clearly supports Unicode level 1 (if we leave alone the issue that it
can't render all these characters out of the box, as it doesn't ship
any fonts); it could be argued that it implements level 3, as it is
capable of representing all Unicode characters (but, of course, so
does Python 1.5.2, if you put UTF-8 into byte strings).

Regards,
Martin



More information about the Python-list mailing list