[Python-Dev] Improve open() to support reading file starting with an unicode BOM

Guido van Rossum guido at python.org
Fri Jan 8 16:52:48 CET 2010


On Thu, Jan 7, 2010 at 11:55 PM, Glyph Lefkowitz
<glyph at twistedmatrix.com> wrote:
> I'm saying that the BOM itself isn't enough to detect that the file is actually UTF-8.

And I'm saying that it is, with as much certainty as we can ever guess
the encoding of a file.

> If (for whatever reason: explicitly specified, guessed in some other way) the file's encoding is determined to be something else, the bytes comprising the BOM should be decoded as normal.  It's just that the UTF-8 decoding of the BOM at the start of a file should be "".

Sure, a Latin-1-encoded file could start with the same pattern that is
a UTF-8-encoded BOM. But at that point, a UTF-16-encoded file is also
valid Latin-1.

The question was in the context of encoding-guessing; if we're
guessing, a UTF-8-encoded BOM cannot signify anything else but UTF-8.
(Ditto for UTF-16 and UTF-32 BOMs.)

-- 
--Guido van Rossum (python.org/~guido)



More information about the Python-Dev mailing list