Re: [Python-Dev] PEP 393 Summer of Code Project

26 Aug 2011

      On Wed, Aug 24, 2011 at 8:34 PM, Greg Ewing  wrote:
...
What about things like the surrogateescape codec that
deliberately use code units in non-standard ways? Will
tricks like that still be possible if the code-unit
level is hidden from the programmer?
I would think that it should still be possible to explicitly put
surrogates into a string, using the appropriate \uxxxx escape or
chr(i) or some such approach; the basic string operations IMO
shouldn't bother with checking for well-formed character sequences
(just as they shouldn't care about normal forms). But decoding bytes
from UTF-16 should not leave any surrogate pairs in, since
interpreting those is part of the decoding.

I'm not sure what should happen with UTF-8 when it (in flagrant
violation of the standard, I presume) contains two separately-encoded
surrogates forming a valid surrogate pair; probably whatever the UTF-8
codec does on a wide build today should be good enough. Similarly for
encoding to UTF-8 on a wide build if one managed to create a string
containing a surrogate pair. Basically, I'm for a
garbage-in-garbage-out approach (with separate library functions to
detect garbage if the app is worried about it).

-- 
--Guido van Rossum (python.org/~guido)