[Python-Dev] Improve open() to support reading file starting with an unicode BOM

Fri Jan 8 22:14:59 CET 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin v. Löwis wrote:

>>> It *is* crazy, but unfortunately rather common.  Wikipedia has a good
>>> description of the issues:
>>> <http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark>.  Basically, some
>>> Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as
>>> being UTF-8, so it's become a convention to do that.  That's not good
>>> enough, so you need to guess the encoding as well to make sure, but if there
>>> is a BOM and you can otherwise verify that the file is probably UTF-8
>>> encoded, you should discard it.
>> That doesn't make sense. If the file isn't UTF-8 you can't see the
>> BOM, because the BOM itself is UTF-8-encoded.
> 
> I think what Glyph meant is this: if a file starts with the UTF-8
> signature, assume it's UTF-8. Then validate the assumption against the
> rest of the file also, and then process it as UTF-8. If the rest clearly
> is not UTF-8, assume that the UTF-8 signature is bogus.

If the programmer opens the file using a "guess using the BOM" encoding,
 Python should *not* attempt to verify that the file is properly
encoded:  it should check for (and consume) any BOM, and then return a
stream which uses the encoding inferred from the BOM.  Any errors should
be handled later, when characters are read, exactly as if the file had
been opened with the same encoding guessed from the BOM.

> I understood this proposal as a general processing guideline, not
> something the io library should do (but, say, a text editor).
> 
> FWIW, I'm personally in favor of using the UTF-8 signature. If people
> consider them crazy talk, that may be because UTF-8 can't possibly have
> a byte order - hence I call it a signature, not the BOM. As a signature,
> I don't consider it crazy at all. There is a long tradition of having
> magic bytes in files (executable files, Postscript, PDF, ... - see
> /etc/magic). Having a magic byte sequence for plain text to denote the
> encoding is useful and helps reducing moji-bake. This is the reason it's
> used on Windows: notepad would normally assume that text is in the ANSI
> code page, and for compatibility, it can't stop doing that. So the UTF-8
> signature gives them an exit strategy.

Agreed.  Having that marker at the start of the file makes interop with
other tools *much* easier.

Tres.
- --
===================================================================
Tres Seaver          +1 540-429-0999          tseaver at palladion.com
Palladion Software   "Excellence by Design"    http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktHoFMACgkQ+gerLs4ltQ73dACffwUfyh6Q9vUnKYf367QFjNcU
RRMAoNuKCWEx7j+MSdTv+UjhAPynBc14
=uAX6
-----END PGP SIGNATURE-----