
Guido van Rossum wrote:
<PEP: 261>
The problem I have with this PEP is that it is a compile time option which makes it hard to work with both 32 bit and 16 bit strings in one program. Can not the 32 bit string type be introduced as an additional type?
Not without an outrageous amount of additional coding (every place in the code that currently uses PyUnicode_Check() would have to be bifurcated in a 16-bit and a 32-bit variant).
Alternatively, a Unicode object could *internally* be either 8, 16 or 32 bits wide (to be clear: not per character, but per string). Also a lot of work, but it'll be a lot less wasteful.
I doubt that the desire to work with both 16- and 32-bit characters in one program is typical for folks using Unicode -- that's mostly limited to folks writing conversion tools. Python will offer the necessary codecs so you shouldn't have this need very often.
Not a lot of people will want to work with 16 or 32 bit chars directly, but I think a less wasteful solution to the surrogate pair problem *will* be desired by people. Why use 32 bits for all strings in a program when only a tiny percentage actually *needs* more than 16? (Or even 8...)
Iteration through the code units in a string is a problem waiting to bite you and string APIs should encourage behaviour which is correct when faced with variable width characters, both DBCS and UTF style.
But this is not the Unicode philosophy. All the variable-length character manipulation is supposed to be taken care of by the codecs, and then the application can deal in arrays of characteres.
Right: this is the way it should be. My difficulty with PEP 261 is that I'm afraid few people will actually enable 32-bit support (*what*?! all unicode strings become 32 bits wide? no way!), therefore making programs non-portable in very subtle ways. Just