[Python-ideas] Processing surrogates in

Stephen J. Turnbull stephen at xemacs.org
Wed May 6 08:36:23 CEST 2015


Andrew Barnert writes:

 > But the PEP 393 machinery doesn't know when it's dealing with
 > strings that are ultimately destined for a UCS-2 application,
 > any more than it can know when it's dealing with strings that have
 > to be pure ASCII or CP1252 or any other character set.

Of course[1] it doesn't, and that's why I say the whole issue is just
frustration speaking.  Whatever we do, it's going to require that the
programmers know what they're doing, or they're just throwing their
garbage in the neighbor's yard.

With respect to doing the check in the str machinery, you can provide
an option that tells PEP 393 str to raise an "OutOfRepertoireError"
(subclass of UnicodeError) on introduction of astral characters to an
instance of str, or provide an API to ask an instance if it's wide
enough to accomodate astral characters.

Either way, the programmer needs to design and implement the
application to use those features, and that's hard.  "Toto!  I don't
think we're in Kansas anymore!"

 > If you want to print emoji to a CP1252 console or write them to a
 > Shift-JIS text file, you get an error from an explicit or implicit
 > `str.encode` that you can debug.

Yup, and these proposals for str2str conversions propose to sneak data
with unknown meaning into the application as if it were well-formed.
This is just like assuming the modular arithmetic that is performed in
registers is actually mathematical integer arithmetic.  You'll almost
never get burned.  Isn't that good enough?

That's not for me to say, but apparently, "small integer arithmetic"
is *not* good enough for Python.

Footnotes: 
[1]  In the current implementation.  We could provide a fontconfig-
like charset facility to describe repertoire restrictions in str, and
code to enforce it.  But this is a delicate question.  Users almost
always hate repertoire restrictions when imposed for the programmer's
convenience: they want to insert emoji, or write foreign words
correctly, or cut-and-paste from email or web pages, or whatever.  And
of course the restrictions may vary depending on the output media.



More information about the Python-ideas mailing list