[Python-Dev] UCS2/UCS4 default

Thu Jul 3 15:14:47 CEST 2008

On Thu, Jul 3, 2008 at 5:39 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> 1. If you are advocating disallowing the use of characters outside the BMP
> in a UCS-2 build, enumerate the advantages of doing so (paying particular
> attention to any advantages which cannot be obtained simply by using an
> appropriate codec that disallows non-BMP characters).

Right now, the same python code has different meaning, depending on a
compile-time option that most users didn't even set for themselves.
Moreover, the errors caused by this semantic difference are not
reported. There's just no way to justify that.

You can't solve this problem by saying 'programmers should choose a
codec that limits them to the BMP when they target 2-byte python,'
because the problem specifically arises when code that works correctly
in a 4-byte python is placed into a 2-byte python, an operation
performed by the users rather than by programmers.

Since 2-byte python is apparently a holdover for memory-limited (and
presumably CPU-limited as well) systems, it doesn't make sense to
impose on it the requirement of correctly dealing with surrogate
pairs. Given that, it seems to me that the best solution would be to
make 4-byte python the default, and also to make 2-byte python raise
an exception when it encounters characters outside the BMP. This way,
a mysterious and unreported semantic error becomes an explicit
syntactic error.

For programmers who want to target a 2-byte format (for win32
compatibility, for example), the correct choice of codec is a superior
solution to forcing a 2-byte internal representation on python.