[Python-Dev] len(chr(i)) = 2?

Terry Reedy tjreedy at udel.edu
Thu Nov 25 06:39:30 CET 2010


On 11/24/2010 3:06 PM, Alexander Belopolsky wrote:

> Any non-trivial text processing is likely to be broken in presence of
> surrogates.  Producing them on input is just trading known issue for
> an unknown one.  Processing surrogate pairs in python code is hard.
> Software that has to support non-BMP characters will most likely be
> written for a wide build and contain subtle bugs when run under a
> narrow build.  Note that my latest proposal does not abolish
> surrogates outright.  Users who want them can still use something like
> "surrogateescape"  error handler for non-BMP characters.

It seems to me that what you are asking for is an alternate, optional, 
utf-8-bmp codec that would raise an error, in either direction, for 
non-bmp chars. Then, as you suggest, if one is not prepared for 
surrogates, they are not allowed.

-- 
Terry Jan Reedy



More information about the Python-Dev mailing list