[Python-ideas] Processing surrogates in

Tue May 5 10:23:52 CEST 2015

Serhiy Storchaka writes:

 > Use cases include programs that use tkinter (common build of Tcl/Tk 
 > don't accept non-BMP characters), email or wsgiref.

So, consider Tcl/Tk.  If you use it for input, no problem, it *can't*
produce non-BMP characters.  So you're using it for output.  If
knowing that your design involves tkinter, you deduce you must not
accept non-BMP characters on input, where's your problem?

And ... you looked twice at your proposal?  You have basically
reproduced the codec error handling API for .decode and .encode in a
bunch to str2str "rehandle" functions.  In other words, you need to
know as much to use "rehandle_*" properly as you do to use .decode and
.encode.  I do not see a win for the programmer who is mostly innocent
of encoding knowledge.  What you're going to see is what Ezio points
out in issue18814:

    With Python 2 I've seen lot of people blindingly trying .decode
    when .encode failed (and the other way around) whenever they were
    getting an UnicodeError[...].

    I'm afraid that confused developers will try to (mis)use redecode
    as a workaround to attempt to fix something that shouldn't be
    broken in the first place, without actually understanding what the
    real problem is.

If we apply these rehandle_* thumbs to the holes in the I18N dike,
it's just going to spring more leaks elsewhere.

 > See issue18814. It is not so easy to get desirable result.

That's because it is damn hard to get desirable results, end of story,
nothing to see here, move along, people, move along!  The only way
available to consistently get desirable results is a Swiftian "Modest
Proposal": euthanize all those miserable folks using non-UTF-8
encodings, and start the world over again.

Seriously, I see nothing in issue18814 except frustration.  There's no
plausible account of how these new functions are going to enable naive
programmers to get better results, just complaints that the current
situation is unbearable.  I can't speak to wsgiref, but in email I
think David is overly worried about efficiency: in most mail flows,
the occasional need to mess with surrogates is going to be far
overshadowed by spam/virus filtering and authentication (DKIM
signature verification and DMARC/DKIM/SPF DNS lookups) on pretty much
all real mailflows.

So this proposal merely amounts to reintroduction of the Python 2 str
confusion into Python 3.  It is dangerous *precisely because* the
current situation is so frustrating.  These functions will not be used
by "consenting adults", in most cases.  Those with sufficient
knowledge for "informed consent" also know enough to decode encoded
text ASAP, and encode internal text ALAP, with appropriate handlers,
in the first place.

Rather, these str2str functions will be used by programmers at the
ends of their ropes desperate to suppress "those damned Unicode
errors" by any means available.  In fact, they are most likely to be
used and recommended by *library* writers, because they're the ones
who are least like to have control over input, or to know their
clients' requirements for output.  "Just use rehandle_* to ameliorate
the errors" is going to be far too tempting for them to resist.

That Nick, of all people, supports this proposal is to me just
confirmation that it's frustration, and only frustration, speaking
here.  He used to be one of the strongest supporters of keeping
"native text" (Unicode) and "encoded text" separate by keeping the
latter in bytes.