[Python-ideas] Python 3000 TIOBE -3%

Tue Feb 14 09:45:24 CET 2012

On Tue, Feb 14, 2012 at 6:02 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
>  > and "correctly handle any ASCII compatible encoding, but still
>  > throw UnicodeEncodeError if I'm about to emit corrupted data"
>  > ("ascii+surrogateescape").
>
> Not if I understand what ascii+surrogateescape would do correctly.
> Yes, you can pass through verbatim, but AFAICS you would have to work
> quite hard to do anything to that stream that would cause a
> UnicodeError in your program, even though you corrupt it.  (Eg, delete
> half of a multibyte EUC character.)
>
> The question is what happens if you run into a validating processor
> internally -- then you'll see an error (even though you're just
> passing it through verbatim!)

If you're only round-tripping (i.e. writing back out as
"ascii+surrogateescape") it's very hard to corrupt your data stream
with processing that assumes an ASCII compatible encoding (as you
point out, you'd have to be splitting on arbitrary codepoints instead
of searching for ASCII first).

However, it's trivial to get an error when you go to encode the data
stream without one of the silencing error handlers set. In particular,
sys.stdout has error handling set to strict, which I believe is likely
to throw UnicodeEncodeError if you try to feed a string containing
surrogate escaped bytes to an encoding that can't handle them. (Of
course, if sys.stdout.encoding is "UTF-8", then you're right, those
characters will just be displayed as gibberish, as they would in the
latin-1 case. I guess its only on Windows and in any other locations
with a more restrictive default stdout encoding that errors are
particularly likely).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia