[I18n-sig] UTF-8 and BOM

Paul Prescod paulp@ActiveState.com
Wed, 16 May 2001 14:57:06 -0700

"M.-A. Lemburg" wrote:
> Note that BYTE ORDER MARK is only a comment for char point
> '\ufeff'. The real name is: ZERO WIDTH NO-BREAK SPACE. Adding
> or removing these will not cause any visible effect in the
> text or change the formatting. That's why you can add or
> remove them at your own will.

I'm not sure I buy that, but one could argue that a Zero width no-break
space character is a legitimate character whether you can see it on a
computer screen or not...but I don't care enough to make that argument.

> So what do you want to see in 2.2 ? ... Have the UTF-8 codec remove
> all BOM marks from its input, or add BOM marks in some places
> or add a codec utf-8-bom which prepends BOM to the start of
> all encoded strings ?

I'd like the UTF-8 codec to treat BOMs (especially leading BOMs) as the
UTF-16 one does. Probably BOM_UTF8 should be added to codecs.py. I'm not
sure whether we need another codec. Probably not...

Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook