[Python-Dev] UCS2/UCS4 default
Nick Coghlan
ncoghlan at gmail.com
Thu Jul 3 14:39:29 CEST 2008
Jeroen Ruigrok van der Werven wrote:
> The documentation for len() says:
> Return the length (the number of items) of an object.
So what this tells us is that in a UCS-2 build of Python, the "items" in
a unicode string are not, strictly speaking, Unicode code points or
characters. Instead, they are successive 16-bit fragments of a UTF-16
encoded string (which correspond to characters only if there are no
surrogate pairs present in the string).
Let's look at the options here:
1. System is NOT memory limited (i.e. most desktops): use a UCS-4 Python
build, which is what most Linux distributions do (I'm not sure about the
pydotorg provided Windows or Mac OS X builds).
2. System is memory limited, only BMP Unicode code points are used: use
a UCS-2 Python build, limit yourself to characters on the BMP (possibly
enforced by use of an appropriate codec to decode input text).
3. System is memory limited, but needs to support characters beyond the
BMP: use a UCS-2 Python build, handling any codepoints outside the BMP
in application code.
The current Python approach handles all three cases relatively
gracefully and with minimal overhead. Dealing natively with surrogate
pair issues could easily result in pointless complexity for cases 1 and
2, while completely disallowing codepoints beyond the BMP in a UCS-2
build would needlessly rule out option 3.
So here's the challenge:
1. If you are advocating disallowing the use of characters outside the
BMP in a UCS-2 build, enumerate the advantages of doing so (paying
particular attention to any advantages which cannot be obtained simply
by using an appropriate codec that disallows non-BMP characters).
2. If you are advocating making the "items" in a Unicode string code
points even in a UCS-2 build, enumerate all of the string behaviours
that would have to change, as well as indicating how to avoid causing a
reduction in speed for cases 1 and 2 above.
Sure, option 2 might be nice to have, but the purity argument isn't
going to be anywhere near enough motivation to justify the additional
code complexity - there need to be practical benefits that aren't better
met just by sacrificing a bit of memory efficiency and switching to a
UCS-4 build.
Cheers,
Nick.
--
Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
---------------------------------------------------------------
http://www.boredomandlaziness.org
More information about the Python-Dev
mailing list