On Tue, Feb 14, 2012 at 6:02 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
> and "correctly handle any ASCII compatible encoding, but still > throw UnicodeEncodeError if I'm about to emit corrupted data" > ("ascii+surrogateescape").
Not if I understand what ascii+surrogateescape would do correctly. Yes, you can pass through verbatim, but AFAICS you would have to work quite hard to do anything to that stream that would cause a UnicodeError in your program, even though you corrupt it. (Eg, delete half of a multibyte EUC character.)
The question is what happens if you run into a validating processor internally -- then you'll see an error (even though you're just passing it through verbatim!)
If you're only round-tripping (i.e. writing back out as "ascii+surrogateescape") it's very hard to corrupt your data stream with processing that assumes an ASCII compatible encoding (as you point out, you'd have to be splitting on arbitrary codepoints instead of searching for ASCII first). However, it's trivial to get an error when you go to encode the data stream without one of the silencing error handlers set. In particular, sys.stdout has error handling set to strict, which I believe is likely to throw UnicodeEncodeError if you try to feed a string containing surrogate escaped bytes to an encoding that can't handle them. (Of course, if sys.stdout.encoding is "UTF-8", then you're right, those characters will just be displayed as gibberish, as they would in the latin-1 case. I guess its only on Windows and in any other locations with a more restrictive default stdout encoding that errors are particularly likely). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia