[Python-Dev] New Py_UNICODE doc

Sun May 8 10:53:11 CEST 2005

M.-A. Lemburg wrote:
> Unicode has many code points that are meant only for composition
> and don't have any standalone meaning, e.g. a combining acute
> accent (U+0301), yet they are perfectly valid code points -
> regardless of UCS-2 or UCS-4. It is easily possible to break
> such a combining sequence using slicing, so the most
> often presented argument for using UCS-4 instead of UCS-2
> (+ surrogates) is rather weak if seen by daylight.

I disagree. It is not just about slicing, it is also about
searching for a character (either through the "in" operator,
or through regular expressions). If you define an SRE character
class, such a character class cannot hold a non-BMP character
in UTF-16 mode, but it can in UCS-4 mode. Consequently,
implementing XML's lexical classes (such as Name, NCName, etc.)
is much easier in UCS-4 than it is in UCS-2. In this case,
combining characters do not matter much, because the XML
spec is defined in terms of Unicode coded characters, causing
combining characters to appear as separate entities for lexical
purposes (unlike half surrogates).

Regards,
Martin