BOM should be ignored by Python
mhammond at skippinet.com.au
Mon May 1 20:21:44 EDT 2000
"Neil Hodgson" <neilh at scintilla.org> wrote in message
news:YBoP4.9095$v85.58388 at news-server.bigpond.net.au...
> Unicode files may contain an initial Byte Order Mark to describe the
> that the file is encoded. In UTF-8 this is the byte sequence EF BB BF.
> current editor, the Win2K version of Notepad adds this BOM to the front
> files saved as UTF-8. I would like to see the Python interpreter accept
> ignore this at the start of a file. The current behaviour is to throw a
I believe this was discussed on python-dev, and decided that Python itself
should not handle BOM markers at all - simply leave them to the app. It
would be a little painful to change the Python file read semantics to
handle this only when reading the first 2 bytes of a disk-based file.
Further, Python would need to maintain the BOM read for a particular
stream, so it can be applied to later, potentially disjointed reads of the
So it was decided that this is purely an application issue. The app
should open the file, read the first 2 bytes, and take whatever action it
FWIW, some other MS documentation says this is the "official" way to
determine if a text file is unicode or ascii. So it really would be a big
ask to expect Python to be able to have 3 modes for reading a file, all
based on the first 2 bytes - no BOM == ascii, and the 2 BOM values...
> In the future, the BOM could also be used to change the behaviour of
How would this work? I could see that it could change the parser (and I
guess the compiler), but how the interpreter? Read the BOM from stdin?
More information about the Python-list