[Python-Dev] Python3 "complexity"

Thu Jan 9 15:13:33 CET 2014

On Thu, Jan 09, 2014 at 01:00:59PM +0000, Kristján Valur Jónsson wrote:

> Which reminds me, can Python3 read text files with BOM automatically yet?

I'm not sure what you mean by that. If you mean, can Python3 distinguish 
between UTF-16BE and UTF-16LE on the basis of a BOM, then it's been able 
to do that for a long time:

steve at orac:~$ hexdump sample-utf-16.txt
0000000 feff 0048 0065 006c 006c 006f 0020 0057
0000010 006f 0072 006c 0064 0021 000a 00a2 00a3
0000020 00a7 2022 00b6 00df 03c0 2248 2206 000a
0000030
steve at orac:~$ python3.1 -c "print(open('sample-utf-16.txt', encoding='utf-16').read())"
Hello World!
¢£§•¶ßπ≈∆

If you mean, "Will Python assume that the presence of bytes FEFF or FFFE
at the start of a file means that it is encoded in UTF-16?", then as 
far as I know, the answer is "No":

[steve at ando ~]$ python3.3 -c "print(open('sample-utf-16.txt').read())"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.3/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: 
invalid start byte

I wouldn't want it to guess the encoding by default. See the Zen about 
ambiguity.

-- 
Steven