On Thu, 28 Aug 2014 10:54:44 -0700, Glenn Linderman <v+python@g.nevcal.com> wrote:
On 8/28/2014 10:41 AM, R. David Murray wrote:
On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman <v+python@g.nevcal.com> wrote:
On 8/28/2014 12:30 AM, MRAB wrote:
There'll be a surrogate escape if a byte couldn't be decoded, but just because a byte could be decoded, it doesn't mean that it's correct.
If you picked the wrong encoding, the other codepoints could be wrong too. Aha! Thanks for pointing out the flaw in my reasoning. But that means it is also pretty useless to "replace_surrogate_escapes" at all, because it only cleans out the non-decodable characters, not the incorrectly decoded characters. Well, replace would still be useful for ASCII+surrogateescape.
How?
Because there "can't" be any incorrectly decoded bytes in the ASCII part, so all undecodable bytes turning into 'unrecognized character' glyphs is useful. "can't" is in quotes because of course if you decode random binary data as ASCII+surrogate escape you could get a mess just like any other encoding, so this is really a "more *likely* to be useful" version of my second point, because "real" ASCII with some junk bytes mixed in is much more likely to be encountered in the wild than, say, utf-8 with some junk bytes mixed in (although is probably changing as use of utf-8 becomes more widespread, so this point applies to utf-8 as well).
Also for cases where the data stream is *supposed* to be in a given encoding, but contains undecodable bytes. Showing the stuff that incorrectly decodes as whatever it decodes to is generally what you want in that case.
Sure, people can learn to recognize mojibake for what it is, and maybe even learn to recognize it for what it was intended to be, in limited domains. But suppressing/replacing the surrogates doesn't help with
Well, it does if the alternative is not being able to display the string to the user at all. And yeah, people being able to recognize mojibake in specific problem domains is what I'm talking about...not perhaps a great use case, but it is a use case.
that... would it not be better to replace the surrogates with an escape sequence that shows the original, undecodable, byte value? Like \xNN ?
Yeah, that idea has been floated as well, and I think it would indeed be more useful than the 'unknown character' glyph. I've also seen fonts that display the hex code inside a box character when the code point is unknown, which would be cool...but that can hardly be part of unicode, can it? :) --David