[Python-Dev] UCS2/UCS4 default

Thu Jul 3 14:39:29 CEST 2008

Jeroen Ruigrok van der Werven wrote:
> The documentation for len() says:
> Return the length (the number of items) of an object.

So what this tells us is that in a UCS-2 build of Python, the "items" in 
a unicode string are not, strictly speaking, Unicode code points or 
characters. Instead, they are successive 16-bit fragments of a UTF-16 
encoded string (which correspond to characters only if there are no 
surrogate pairs present in the string).

Let's look at the options here:

1. System is NOT memory limited (i.e. most desktops): use a UCS-4 Python 
build, which is what most Linux distributions do (I'm not sure about the 
pydotorg provided Windows or Mac OS X builds).

2. System is memory limited, only BMP Unicode code points are used: use 
a UCS-2 Python build, limit yourself to characters on the BMP (possibly 
enforced by use of an appropriate codec to decode input text).

3. System is memory limited, but needs to support characters beyond the 
BMP: use a UCS-2 Python build, handling any codepoints outside the BMP 
in application code.

The current Python approach handles all three cases relatively 
gracefully and with minimal overhead. Dealing natively with surrogate 
pair issues could easily result in pointless complexity for cases 1 and 
2, while completely disallowing codepoints beyond the BMP in a UCS-2 
build would needlessly rule out option 3.

So here's the challenge:

1. If you are advocating disallowing the use of characters outside the 
BMP in a UCS-2 build, enumerate the advantages of doing so (paying 
particular attention to any advantages which cannot be obtained simply 
by using an appropriate codec that disallows non-BMP characters).

2. If you are advocating making the "items" in a Unicode string code 
points even in a UCS-2 build, enumerate all of the string behaviours 
that would have to change, as well as indicating how to avoid causing a 
reduction in speed for cases 1 and 2 above.

Sure, option 2 might be nice to have, but the purity argument isn't 
going to be anywhere near enough motivation to justify the additional 
code complexity - there need to be practical benefits that aren't better 
met just by sacrificing a bit of memory efficiency and switching to a 
UCS-4 build.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org