On 11 February 2012 19:46, Masklinn <masklinn@masklinn.net> wrote:
Besides, it's perfectly possible to process bytes in Python 3. You just have to open the file in binary mode and do the processing at the byte string level.
I think that's the route which should be taken
Oh, absolutely not. When it's text, it's best to process it as Unicode.
Except it's not processed as text, it's processed as "stuff with ascii characters in it". Might just as well be cp-1252, or UTF-8, or Shift JIS (which is kinda-sorta-extended-ascii but not exactly), and while using an ISO-8859 will yield unicode data that's about the only thing you can say about it and the actual result will probably be mojibake either way.
No, not at all. It *is* text. I *know* it's text. I know that it is encoded in an ASCII-superset (because I can read it in a text editor and *see* that it is). What I *don't* know is what those funny bits of mojibake I see in the text editor are. But I don't really care. Yes, I could do some analysis based on the surrounding text and confirm whether it's latin-1, utf-8, or something similar. But it honestly doesn't matter to me, as all I care about is parsing the file to find the change authors, and printing their names (to re-use the "manipulating a ChangeLog file" example). And even if it did matter, the next file might be in a different ASCII-superset encoding, but I *still* won't care because the parsing code will be exactly the same. Saying "it's bytes" is even more of a lie than "it's latin-1". The honest truth is "it's an ASCII superset", and that's all I need to know to do the job manually, so I'd like to write code to do the same job without needing to lie about what I know. I'm now 100% convinced that encoding="ascii",errors="surrogateescape" is the way to say this in code. Paul.