On Wed, Aug 24, 2011 at 8:34 PM, Greg Ewing
What about things like the surrogateescape codec that deliberately use code units in non-standard ways? Will tricks like that still be possible if the code-unit level is hidden from the programmer?
I would think that it should still be possible to explicitly put surrogates into a string, using the appropriate \uxxxx escape or chr(i) or some such approach; the basic string operations IMO shouldn't bother with checking for well-formed character sequences (just as they shouldn't care about normal forms). But decoding bytes from UTF-16 should not leave any surrogate pairs in, since interpreting those is part of the decoding. I'm not sure what should happen with UTF-8 when it (in flagrant violation of the standard, I presume) contains two separately-encoded surrogates forming a valid surrogate pair; probably whatever the UTF-8 codec does on a wide build today should be good enough. Similarly for encoding to UTF-8 on a wide build if one managed to create a string containing a surrogate pair. Basically, I'm for a garbage-in-garbage-out approach (with separate library functions to detect garbage if the app is worried about it). -- --Guido van Rossum (python.org/~guido)