On Mon, Feb 13, 2012 at 2:50 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Masklinn writes:
> Why not open the file in binary mode in stead? (and replace `'*'` > by `b'*'` in the startswith call)
This will often work, but it's task-dependent. In particular, I believe not just `.startswith(), but general regexps work with either bytes or str in Python 3. But other APIs may not. and you're going to need to prefix *all* literals (including those in modules your code imports!) with `b`. So you import a module that does exactly what you want, and be stymied by a TypeError because the module wants Unicode.
This would not happen with Python 2, and there's the rub.
The other trap is APIs like urllib.parse which explicitly refuse the temptation to guess when it comes to bytes data, and decodes it as "ascii+strict". If you want it to do something else that's more permissive (e.g. "latin-1" or "ascii+surrogateescape") then you *have* to decode it to Unicode yourself before handing it over. Really, Python 3 forces programmers to learn enough about Unicode to be able to make the choice between the 4 possible options for processing ASCII-compatible encodings: 1. Process them as binary data. This is often *not* going to be what you want, since many text processing APIs will either only accept Unicode, or only pure ASCII, or require you to supply encoding+errors if you want them to process binary data. 2. Process them as "latin-1". This is the answer that completely bypasses all Unicode integrity checks. If you get fed non-ASCII data, you *will* silently produce gibberish as output. 3. Process them as "ascii+surrogateescape". This is the *right* answer if you plan solely to manipulate the text and then write it back out in the same encoding as was originally received. You will get errors if you try to write a string with escaped characters out to a non-ascii channel or an ascii channel without surrogateescape enabled. To write such strings to non-ascii channels (e.g. sys.stdout), you need to remember to use something like "ascii+replace" to mask out the values with unknown encoding first. You may still get hard to debug UnicodeEncodeError exceptions when handed data in a non-ASCII compatible encoding (like UTF-16 or UTF-32), but your odds of silently corrupting data are fairly low. 4. Get a third party encoding guessing library and use that instead of waving away the problem of ASCII-incompatible encodings. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia