[Python-ideas] Processing surrogates in
Andrew Barnert
abarnert at yahoo.com
Tue May 5 12:00:53 CEST 2015
On May 5, 2015, at 01:23, Stephen J. Turnbull <stephen at xemacs.org> wrote:
>
> Serhiy Storchaka writes:
>
>> Use cases include programs that use tkinter (common build of Tcl/Tk
>> don't accept non-BMP characters), email or wsgiref.
>
> So, consider Tcl/Tk. If you use it for input, no problem, it *can't*
> produce non-BMP characters. So you're using it for output. If
> knowing that your design involves tkinter, you deduce you must not
> accept non-BMP characters on input, where's your problem?
The real issue with tkinter (and similar cases that can't handle BMP) is that they're actually UCS-2, and we paper over that by pretending the interface is Unicode. Maybe it would be better to wrap the low-level interfaces in `bytes` rather than `str` and put an explicit `.encode('UCS-2')` in the higher-level interfaces (or even in user code?) to make the problem obvious and debuggable rather than just pretending the problem doesn't exist?
(I'm not sure if we actually have a UCS-2 codec, but if not, it's trivial to write--it's just UTF-16 without surrogates.)
> And ... you looked twice at your proposal? You have basically
> reproduced the codec error handling API for .decode and .encode in a
> bunch to str2str "rehandle" functions. In other words, you need to
> know as much to use "rehandle_*" properly as you do to use .decode and
> .encode. I do not see a win for the programmer who is mostly innocent
> of encoding knowledge. What you're going to see is what Ezio points
> out in issue18814:
>
> With Python 2 I've seen lot of people blindingly trying .decode
> when .encode failed (and the other way around) whenever they were
> getting an UnicodeError[...].
>
> I'm afraid that confused developers will try to (mis)use redecode
> as a workaround to attempt to fix something that shouldn't be
> broken in the first place, without actually understanding what the
> real problem is.
>
> If we apply these rehandle_* thumbs to the holes in the I18N dike,
> it's just going to spring more leaks elsewhere.
>
>> See issue18814. It is not so easy to get desirable result.
>
> That's because it is damn hard to get desirable results, end of story,
> nothing to see here, move along, people, move along! The only way
> available to consistently get desirable results is a Swiftian "Modest
> Proposal": euthanize all those miserable folks using non-UTF-8
> encodings, and start the world over again.
>
> Seriously, I see nothing in issue18814 except frustration. There's no
> plausible account of how these new functions are going to enable naive
> programmers to get better results, just complaints that the current
> situation is unbearable. I can't speak to wsgiref, but in email I
> think David is overly worried about efficiency: in most mail flows,
> the occasional need to mess with surrogates is going to be far
> overshadowed by spam/virus filtering and authentication (DKIM
> signature verification and DMARC/DKIM/SPF DNS lookups) on pretty much
> all real mailflows.
>
> So this proposal merely amounts to reintroduction of the Python 2 str
> confusion into Python 3. It is dangerous *precisely because* the
> current situation is so frustrating. These functions will not be used
> by "consenting adults", in most cases. Those with sufficient
> knowledge for "informed consent" also know enough to decode encoded
> text ASAP, and encode internal text ALAP, with appropriate handlers,
> in the first place.
>
> Rather, these str2str functions will be used by programmers at the
> ends of their ropes desperate to suppress "those damned Unicode
> errors" by any means available. In fact, they are most likely to be
> used and recommended by *library* writers, because they're the ones
> who are least like to have control over input, or to know their
> clients' requirements for output. "Just use rehandle_* to ameliorate
> the errors" is going to be far too tempting for them to resist.
>
> That Nick, of all people, supports this proposal is to me just
> confirmation that it's frustration, and only frustration, speaking
> here. He used to be one of the strongest supporters of keeping
> "native text" (Unicode) and "encoded text" separate by keeping the
> latter in bytes.
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
More information about the Python-ideas
mailing list