BOM should be ignored by Python

Mark Hammond mhammond at skippinet.com.au
Mon May 1 20:21:44 EDT 2000


"Neil Hodgson" <neilh at scintilla.org> wrote in message
news:YBoP4.9095$v85.58388 at news-server.bigpond.net.au...

Hi Neil...

>    Unicode files may contain an initial Byte Order Mark to describe the
way
> that the file is encoded. In UTF-8 this is the byte sequence EF BB BF.
One
> current editor, the Win2K version of Notepad adds this BOM to the front
of
> files saved as UTF-8. I would like to see the Python interpreter accept
but
> ignore this at the start of a file. The current behaviour is to throw a
> SyntaxError.

I believe this was discussed on python-dev, and decided that Python itself
should not handle BOM markers at all - simply leave them to the app.  It
would be a little painful to change the Python file read semantics to
handle this only when reading the first 2 bytes of a disk-based file.
Further, Python would need to maintain the BOM read for a particular
stream, so it can be applied to later, potentially disjointed reads of the
file.

So it was decided that this is purely an application issue.  The app
should open the file, read the first 2 bytes, and take whatever action it
needs.

FWIW, some other MS documentation says this is the "official" way to
determine if a text file is unicode or ascii.  So it really would be a big
ask to expect Python to be able to have 3 modes for reading a file, all
based on the first 2 bytes - no BOM == ascii, and the 2 BOM values...

>    In the future, the BOM could also be used to change the behaviour of
the
> interpreter.

How would this work?  I could see that it could change the parser (and I
guess the compiler), but how the interpreter?  Read the BOM from stdin?

Mark.





More information about the Python-list mailing list