BOM should be ignored by Python

machin_sj at my-deja.com machin_sj at my-deja.com
Tue May 2 08:33:17 EDT 2000


In article <scpP4.9110$v85.58552 at news-server.bigpond.net.au>,
  "Mark Hammond" <mhammond at skippinet.com.au> wrote:
> "Neil Hodgson" <neilh at scintilla.org> wrote in message
> news:YBoP4.9095$v85.58388 at news-server.bigpond.net.au...
>
> Hi Neil...
>
> >    Unicode files may contain an initial Byte Order Mark to describe
the
> way
> > that the file is encoded. In UTF-8 this is the byte sequence EF BB
BF.
> One
> > current editor, the Win2K version of Notepad adds this BOM to the
front
> of
> > files saved as UTF-8. I would like to see the Python interpreter
accept
> but
> > ignore this at the start of a file. The current behaviour is to
throw a
> > SyntaxError.
>
> I believe this was discussed on python-dev, and decided that Python
itself
> should not handle BOM markers at all - simply leave them to the app.
It
> would be a little painful to change the Python file read semantics to
> handle this only when reading the first 2 bytes of a disk-based file.
> Further, Python would need to maintain the BOM read for a particular
> stream, so it can be applied to later, potentially disjointed reads
of the
> file.
>
> So it was decided that this is purely an application issue.  The app
> should open the file, read the first 2 bytes, and take whatever
action it
> needs.
>
> FWIW, some other MS documentation says this is the "official" way to
> determine if a text file is unicode or ascii.  So it really would be
a big
> ask to expect Python to be able to have 3 modes for reading a file,
all
> based on the first 2 bytes - no BOM == ascii, and the 2 BOM values...
>
>
This is all rather confusing to a Unicode neophyte. In my innocence I
thought that (1) a BOM was a Byte-Order-Marker that could be used in
UCS-2 data (say at the start of a disk file) to let the reader know
whether the data was big-endian or little-endian. (2) A BOM is
otherwise a meaningless ignorable character (a zero-width non-breaking
space, by definition). Further I thought that (3) it was pointless
having a BOM in UTF-8 which is an 8-bit-unit encoding and endian-ness
is not a question and (4) if converting from UCS-2 to UTF-8 that the
BOM could be included or omitted at the whim of the writer and (5) a
reader of UTF-8 data should be prepared to regard a BOM as legal, not
a "syntax error". I also thought (6) that by careful design of UTF-8,
ASCII data when "converted" to UTF-8 was unchanged so I don't see the
point (for an application that is going to use Unicode internally) in
knowing/caring whether an input file is in ASCII or UTF-8. If there are
other possibilities (like ISO 8859(8559?) Latin-1 for example, then the
application can't rely on the first few bytes to tell it anything.

So, where have I gone wrong? Please enlighten me!

Regards,
John Machin


Sent via Deja.com http://www.deja.com/
Before you buy.



More information about the Python-list mailing list