On 28 Aug 2014, at 19:54, Glenn Linderman wrote:
On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman <v+python@g.nevcal.com> wrote: [...] Also for cases where the data stream is *supposed* to be in a given encoding, but contains undecodable bytes. Showing the stuff that incorrectly decodes as whatever it decodes to is generally what you want in that case. Sure, people can learn to recognize mojibake for what it is, and maybe even learn to recognize it for what it was intended to be, in limited domains. But suppressing/replacing the surrogates doesn't help with
On 8/28/2014 10:41 AM, R. David Murray wrote: that... would it not be better to replace the surrogates with an escape sequence that shows the original, undecodable, byte value? Like \xNN ?
For that we could extend the "backslashreplace" codec error callback, so that it can be used for decoding too, not just for encoding. I.e. b"a\xffb".decode("utf-8", "backslashreplace") would return "a\\xffb" Servus, Walter