Steven D'Aprano writes:
MRAB wrote:
encoding="ascii-ish" # gets the sloppyness right
+0.8 I'd prefer the more precise "ascii-compatible". Shift JIS is "ASCII-ish", but should not be decoded with this codec.
encoding="mojibake" # :-)
You have a smiley, but I think that's the best name I've seen yet. It's explicit in what you get -- mojibake.
Explicit, but incorrect. Mojibake ("bake" means "change") is what you get when you use one encoding to encode characters, and another to decode them. Here, not only are we talking about using the same codec at both ends, but in fact it's inside out (we are decoding then encoding). This is GIGO, not mojibake.
why not just teach them the very slightly more complex recipe
open(filename, encoding='ascii', errors='surrogateescape')
which captures the user's intent ("I want ASCII, with some way of escaping errors so I don't have to deal with them") much more accurately.
Why not? Because 'surrogateescape' does not express the user's intent. That user *will* have to deal with errors as soon as she invokes modules that validate their input, or include some portion of the text being treated in output of any kind, unless they use an error-suppressing handler themselves. Surrogates are errors in Unicode, and that's the way it should be. That's precisely why Martin felt it necessary to use this technique in PEP 383: to ensure that errors *will* occur unless you are very careful in handling strings produced with the surrogateescape handler active. It's arguable that most applications *should* want errors in these cases; I've made that argument myself. But it's quite clearly not the user's intent.