On 11 February 2012 17:00, Masklinn <masklinn@masklinn.net> wrote:
Good example. I believe adding ", encoding='latin-1'" to open() is sufficient.
Why not open the file in binary mode in stead? (and replace `'*'` by `b'*'` in the startswith call)
In my view, that's less scalable to more complex cases. It's likely you'll hit things you need to do that don't translate easily to bytes sooner than if you stick in a string-only world. A simple example, check for a regex rather than a simple starting character. The problem I have with encoding="latin-1" is that in many cases I *know* that's a lie. From what's been said in this discussion so far, I think that the "better" way to say "I know this file contains mostly ASCII, but there's some other bits I'm not sure about but don't care too much as long as they round-trip cleanly" is encoding="ascii",errors="surrogateescape". But as we've seen here, that's not the idiom that gets recommended by everyone (the "One Obvious Way", if you like). I suspect that if the community did embrace a "one obvious way", that would reduce the "Python 3 makes me need to know Unicode" FUD that's around. But as long as people get 3 different answers when they ask the question, there's going to be uncertainty and doubt (and hence, probably, fear...) Paul. PS I'm pretty confident that I have *my* answer now (ascii/surrogateescape). So this thread was of benefit to me, if nothing else, and my thanks for that.