
I'd suggest not to use the term character in this PEP at all; this is also what Mark Davis recommends in his paper on Unicode.
I like this idea! I know that I *still* have a hard time not to think "C 'char' datatype, i.e. an 8-bit byte" when I read "character"...
Why not make the codec used by Python to convert Unicode literals to Unicode strings an option just like the default encoding ?
That way we could have a version of the unicode-escape codec which supports surrogates and one which doesn't.
Smart idea, but how practical is this? Can you spec this out a bit more?
+1 on removing knowledge about surrogates from the Unicode implementation core (it's also the easiest: there is none :-)
Except for \U currently -- or is that not part of the implementation core?
We should provide a new module which provides a few handy utilities though: functions which provide code point-, character-, word- and line- based indexing into Unicode strings.
But its design is outside the scope of this PEP, I'd say. --Guido van Rossum (home page: http://www.python.org/~guido/)