
Alternatively, a Unicode object could *internally* be either 8, 16 or 32 bits wide (to be clear: not per character, but per string). Also a lot of work, but it'll be a lot less wasteful.
Depending on what you prefer to waste: developers' time or computer resources. I bet that if you try the measure the wasted space you'll find that it wastes very little compared to all the other overheads in a typical Python program: CPU time compared to writing your code in C, memory overhead for integers, etc. It so happened that the Unicode support was written to make it very easy to change the compile-time code unit size; but making this a per-string (or even global) run-time variable is much harder without touching almost every place that uses Unicode (not to mention slowing down the common case). Nobody was enthusiastic about fixing this, so our choice was really between staying with 16 bits or making 32 bits an option for those who need it.
Not a lot of people will want to work with 16 or 32 bit chars directly,
How do you know? There are more Chinese than Americans and Europeans together, and they will soon all have computers. :-)
but I think a less wasteful solution to the surrogate pair problem *will* be desired by people. Why use 32 bits for all strings in a program when only a tiny percentage actually *needs* more than 16? (Or even 8...)
So work in UTF-8 -- a lot of work can be done in UTF-8.
But this is not the Unicode philosophy. All the variable-length character manipulation is supposed to be taken care of by the codecs, and then the application can deal in arrays of characteres.
Right: this is the way it should be.
My difficulty with PEP 261 is that I'm afraid few people will actually enable 32-bit support (*what*?! all unicode strings become 32 bits wide? no way!), therefore making programs non-portable in very subtle ways.
My hope and expectation is that those folks who need 32-bit support will enable it. If this solution is not sufficient, we may have to provide something else in the future, but given that the implementation effort for PEP 261 was very minimal (certainly less than the time expended in discussing it) I am very happy with it. It will take quite a while until lots of folks will need the 32-bit support (there aren't that many characters defined outside the basic plane yet). In the mean time, those that need to 32-bit support should be happy that we allow them to rebuild Python with 32-bit support. In the next 5-10 years, the 32-bit support requirement will become more common -- as will be the memory upgrades to make it painless. It's not like Python is making this decision in a vacuum either: Linux already has 32-bit wchar_t. 32-bit characters will eventually be common (even in Windows, which probably has the largest investment in 16-bit Unicode at the moment of any system). Like IPv6, we're trying to enable uncommon uses of Python without breaking things for the not-so-early adopters. Again, don't see PEP 261 as the ultimate answer to all your 32-bit Unicode questions. Just consider that realistically we have two choices: stick with 16-bit support only or make 32-bit support an option. Other approaches (more surrogate support, run-time choices, transparent variable-length encodings) simply aren't realistic -- no-one has the time to code them. It should be easy to write portable Python programs that work correctly with 16-bit Unicode characters on a "narrow" interpreter and also work correctly with 21-bit Unicode on a "wide" interpreter: just avoid using surrogates. If you *need* to work with surrogates, try to limit yourself to very simple operations like concatenations of valid strings, and splitting strings at known delimiters only. There's a lot you can do with this. --Guido van Rossum (home page: http://www.python.org/~guido/)