[I18n-sig] UTF-8 and BOM

M.-A. Lemburg mal@lemburg.com
Wed, 16 May 2001 21:59:50 +0200

Paul Prescod wrote:
> "M.-A. Lemburg" wrote:
> >
> >...
> >
> > BOMs are standard Unicode char points, so they are legal in all
> > Unicode encodings.
> My point is that it is legal to interpret it as a BOM and not just a
> character.

That's correct (and also the reasoning behind adding BOM in
files or streams and being allowed to remove them at your
own will).
> >...
> > Uhm, I can't follow you here... BOMs in UTF-8 look like this:
> >
> > >>> u'\ufeff'.encode('utf-8')
> > '\xef\xbb\xbf'
> >
> > which is somewhat different from '\xff\xfe' or '\xfe\xff'.
> That's what's great about it!

Ok, now I get it: you want to use '\xef\xbb\xbf' as file encoding
identifier. Sounds like a good idea !
> >...
> > >>> u'\ufeff'.encode('utf-16')
> > '\xff\xfe\xff\xfe'
> It is curious that decoding this removes both FEFF characters. Is it
> right that the decoder removes all BOM sequences?
> >>> codecs.utf_16_decode(  codecs.BOM*10 + "a".encode("UTF-16") + codecs.BOM*10)
> (u'a', 44)

Yes. The codec is smart enough to even handle input stream
with mixed byte orders (it switches dynamically based on what
it finds in the stream).

Note that BYTE ORDER MARK is only a comment for char point
'\ufeff'. The real name is: ZERO WIDTH NO-BREAK SPACE. Adding
or removing these will not cause any visible effect in the
text or change the formatting. That's why you can add or
remove them at your own will.

So what do you want to see in 2.2 ? ... Have the UTF-8 codec remove
all BOM marks from its input, or add BOM marks in some places
or add a codec utf-8-bom which prepends BOM to the start of
all encoded strings ?

Marc-Andre Lemburg
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/