On 2012-02-11, at 20:35 , Stefan Behnel wrote:
Yes, but now instead of just ignoring that stuff you have to actively and knowingly lie to Python to get it to shut up.
The advantage is that it becomes explicit what you are doing. In Python 2, without any encoding, you are implicitly assuming that the encoding is Latin-1, because that's how you are processing it. You're just not spelling it out anywhere, thus leaving it to the innocent reader to guess what's happening. In Python 3, and in better Python 2 code (using codecs.open(), for example), you'd make it clear right in the open() call that Latin-1 is the way you are going to process the data.
I'm not sure going from "ignoring it" to "explicitly lying about it" is a great step forward. latin-1 is not "the way you are going to process the data" in this case, it's just the easiest way to get Python to shut up and open the damn thing.
Besides, it's perfectly possible to process bytes in Python 3. You just have to open the file in binary mode and do the processing at the byte string level.
I think that's the route which should be taken
Oh, absolutely not. When it's text, it's best to process it as Unicode.
Except it's not processed as text, it's processed as "stuff with ascii characters in it". Might just as well be cp-1252, or UTF-8, or Shift JIS (which is kinda-sorta-extended-ascii but not exactly), and while using an ISO-8859 will yield unicode data that's about the only thing you can say about it and the actual result will probably be mojibake either way. By processing it as bytes, it's made explicit that this is not known and decoded text (which is what unicode strings imply) but that it's some semi-arbitrary ascii-compatible encoding and that's the extent of the developer's knowledge and interest in it.