[Python-Dev] Improve open() to support reading file starting with an unicode BOM
"Martin v. Löwis"
martin at v.loewis.de
Fri Jan 8 10:05:17 CET 2010
>> It *is* crazy, but unfortunately rather common. Wikipedia has a good
>> description of the issues:
>> <http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark>. Basically, some
>> Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as
>> being UTF-8, so it's become a convention to do that. That's not good
>> enough, so you need to guess the encoding as well to make sure, but if there
>> is a BOM and you can otherwise verify that the file is probably UTF-8
>> encoded, you should discard it.
> That doesn't make sense. If the file isn't UTF-8 you can't see the
> BOM, because the BOM itself is UTF-8-encoded.
I think what Glyph meant is this: if a file starts with the UTF-8
signature, assume it's UTF-8. Then validate the assumption against the
rest of the file also, and then process it as UTF-8. If the rest clearly
is not UTF-8, assume that the UTF-8 signature is bogus.
I understood this proposal as a general processing guideline, not
something the io library should do (but, say, a text editor).
FWIW, I'm personally in favor of using the UTF-8 signature. If people
consider them crazy talk, that may be because UTF-8 can't possibly have
a byte order - hence I call it a signature, not the BOM. As a signature,
I don't consider it crazy at all. There is a long tradition of having
magic bytes in files (executable files, Postscript, PDF, ... - see
/etc/magic). Having a magic byte sequence for plain text to denote the
encoding is useful and helps reducing moji-bake. This is the reason it's
used on Windows: notepad would normally assume that text is in the ANSI
code page, and for compatibility, it can't stop doing that. So the UTF-8
signature gives them an exit strategy.
More information about the Python-Dev