[I18n-sig] UTF-8 and BOM

M.-A. Lemburg mal@lemburg.com
Thu, 17 May 2001 00:20:49 +0200

Paul Prescod wrote:
> "M.-A. Lemburg" wrote:
> >
> >...
> >
> > Note that BYTE ORDER MARK is only a comment for char point
> > '\ufeff'. The real name is: ZERO WIDTH NO-BREAK SPACE. Adding
> > or removing these will not cause any visible effect in the
> > text or change the formatting. That's why you can add or
> > remove them at your own will.
> I'm not sure I buy that, but one could argue that a Zero width no-break
> space character is a legitimate character whether you can see it on a
> computer screen or not...but I don't care enough to make that argument.

Text data is different than binary data. Unicode text
which uses combining characters (e.g. accent and 'e' to produce
'') is equivalent to text which uses the combined character
point directly. This corner of Unicode is not well covered yet
in Python's Unicode implementation. The two major missing
items are normalization and collation support.
> > So what do you want to see in 2.2 ? ... Have the UTF-8 codec remove
> > all BOM marks from its input, or add BOM marks in some places
> > or add a codec utf-8-bom which prepends BOM to the start of
> > all encoded strings ?
> I'd like the UTF-8 codec to treat BOMs (especially leading BOMs) as the
> UTF-16 one does. Probably BOM_UTF8 should be added to codecs.py. I'm not
> sure whether we need another codec. Probably not...

You have to be careful here: UTF-16 prepends a BOM mark to
every string pushed through the codec -- even small snippets.
You certainly don't want to make that the default for the
much more common UTF-8 which has no real requirement to include
BOM marks at all... having the decoder automatically remove
BOM marks is easy to implement and won't cause any harm,
but carelessly adding them will get us into trouble.

Marc-Andre Lemburg
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/