On 13 February 2012 05:12, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Paul Moore writes:
> I'm now 100% convinced that > encoding="ascii",errors="surrogateescape" is the way to say this in > code.
It probably is, for you. If that ever gives you a UnicodeError, you know how to find out how to deal with it. And it probably won't.<wink/>
And yet, after your earlier posting on latin-1, and your comments here, I'm less certain. Thank you so much :-) Seriously, I find these discussions about Unicode immensely useful. I now have a much better feel for how to deal with (and think about) text in "unknown but mostly ASCII" format, which can only be a good thing.
I don't think either argument applies to everybody who needs such a recipe, though. Many will be best served with encoding='latin-1' by some name.
Probably the key question is, how do we encapsulate this debate in a simple form suitable for people to find out about *without* feeling like they "have to learn all about Unicode"? A note in the Unicode HOWTO seems worthwhile, but how to get people to look there? Given that this is people who don't want to delve too deeply into Unicode issues. Just to be clear, my reluctance to "do the right thing" was *not* because I didn't want to understand Unicode - far from it, I'm interested in, and inclined towards, "doing Unicode right". The problem is that I know enough to realise that "proper" handling of files where I don't know the encoding, and it seems to be inconsistent sometimes (both between files, and even on occasion within a file), is a seriously hard issue. And I don't want to get into really hard Unicode issues for what, in practical terms, is a simple problem as it's one-off code and minor corruption isn't really an issue. Paul.