Carl M. Johnson writes:
On Feb 10, 2012, at 5:32 PM, Stephen J. Turnbull wrote:
will founder on 'Óscar Fuentes' as author, unless you know what coding system is used, or know enough to use latin-1 (because it's effectively binary, not because it's the actual encoding).
Or just use errors="surrogateescape". I think we should tell people who are scared of unicode and refuse to learn how to use it to just add an errors="surrogateescape" keyword to their file open arguments. Obviously, it's the wrong thing to do, but it's wrong in the same way that Python 2 bytes are wrong, so if you're absolutely committed to remaining ignorant of encodings, you can continue to do that.
No, it's not the same as Python 2, and it's *subtly* the wrong thing to do, too. surrogateescape is intended to roundtrip on input from a specific API to unchanged output to that same API, and that's all it it is guaranteed to do. Less pedantically, if you use latin-1, the internal representation is valid Unicode but (partially) incorrect content. No UnicodeErrors. If you use errors="surrogateescape", any code that insists on valid Unicode will crash. Here I'm talking about a use case where the user believes that as long as the ASCII content is correct they will get correct output. It's arguable that using errors="surrogateescape" is a better approach, *because* of the possibility of a validity check. I tend to think not. But that's a different argument from "same as Python 2".