Nick Coghlan writes:
I'd hazard a guess that the non-ASCII compatible encoding mostly likely to be encountered outside Asia is UTF-16.
In other words, only people who insist on messing with application/octet-stream files (like Word ;-). They don't deserve the pain, but they're gonna feel it anyway.
The choice is really between "never give me UnicodeErrors, but feel free to silently corrupt the data stream if I do the wrong thing with that data" (i.e. "latin-1")
Yes.
and "correctly handle any ASCII compatible encoding, but still throw UnicodeEncodeError if I'm about to emit corrupted data" ("ascii+surrogateescape").
Not if I understand what ascii+surrogateescape would do correctly. Yes, you can pass through verbatim, but AFAICS you would have to work quite hard to do anything to that stream that would cause a UnicodeError in your program, even though you corrupt it. (Eg, delete half of a multibyte EUC character.) The question is what happens if you run into a validating processor internally -- then you'll see an error (even though you're just passing it through verbatim!)