[Python-Dev] Python3 "complexity"

Thu Jan 9 14:24:53 CET 2014

On 9 January 2014 13:00, Kristján Valur Jónsson <kristjan at ccpgames.com> wrote:
>> You don't say what problems, but I assume encoding/decoding errors. So the
>> files apparently weren't in the system encoding. OK, at that point I'd
>> probably say to heck with it and use latin-1. Assuming I was sure that (a) I'd
>> never hit a non-ascii compatible file (e.g., UTF16) and
>> (b) I didn't have a decent means of knowing the encoding.
> Right.  But even latin-1, or better, cp1252 (on windows) does not solve it because these have undefined
> code points.  So you need 'surrogateescape' error handling as well.  Something that I didn't know at
> the time, having just come from python 2 and knowing its Unicode model well.

>>> bin = bytes(range(256))
>>> bin
b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\
x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\
x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x
9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb
8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4
\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\
xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
>>> bin.decode('latin-1')
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x
1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x
80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9
c\x9d\x9e\x9f\xa0¡¢£\xa4¥\xa6\xa7\xa8\xa9ª«¬\xad\xae\xaf°±²\xb3\xb4µ\xb6·\xb8\xb9º»¼½\xbe¿\xc0\xc1\xc2\xc3ÄÅÆÇ\xc
8É\xca\xcb\xcc\xcd\xce\xcf\xd0Ñ\xd2\xd3\xd4\xd5Ö\xd7\xd8\xd9\xda\xdbÜ\xdd\xdeßàáâ\xe3äåæçèéêëìíîï\xf0ñòóô\xf5ö÷\x
f8ùúûü\xfd\xfeÿ'

No undefined bytes there. If you mean that latin-1 can't encode all of
the Unicode code points, then how did those code points get in there?
Presumably you put them in, and so you're not just playing with the
ASCII text parts. And you *do* need to understand encodings.

>> One thing that genuinely is difficult is that because disk files don't have any
>> out-of-band data defining their encoding, it *can* be hard to know what
>> encoding to use in an environment where more than one encoding is
>> common. But this isn't really a Python issue - as I say, I've hit it with GNU
>> tools, and I've had to explain the issue to colleagues using Java on many
>> occasions. The key difference is that with grep, people blame the file,
>> whereas with Python people blame the language :-) (Of course, with Java,
>> people expect this sort of problem so they blame the perverseness of the
>> universe as a whole... ;-))
>
> Which reminds me, can Python3 read text files with BOM automatically yet?

If by "automatically" you mean "reads the BOM and chooses an
appropriate encoding based on it" then I don't know, but I suspect
not. But unless you're worried about 2-byte encodings (see! you need
to understand encodings again!) latin-1 will still work.

It sounds to me like what you *really* want is something that
autodetects encodings on Windows in the same sort of way as other
Windows tools like Notepad does. That's a fair thing to want, but no,
Python doesn't provide it (nor did Python 2). I suspect that it would
be possible to write a codec to do this, though. Maybe there's even
one on PyPI.

Paul