[I18n-sig] UTF-8 and BOM
Paul Prescod
paulp@ActiveState.com
Wed, 16 May 2001 12:26:41 -0700
"M.-A. Lemburg" wrote:
>
>...
>
> BOMs are standard Unicode char points, so they are legal in all
> Unicode encodings.
My point is that it is legal to interpret it as a BOM and not just a
character.
>...
> Uhm, I can't follow you here... BOMs in UTF-8 look like this:
>
> >>> u'\ufeff'.encode('utf-8')
> '\xef\xbb\xbf'
>
> which is somewhat different from '\xff\xfe' or '\xfe\xff'.
That's what's great about it!
>...
> >>> u'\ufeff'.encode('utf-16')
> '\xff\xfe\xff\xfe'
It is curious that decoding this removes both FEFF characters. Is it
right that the decoder removes all BOM sequences?
>>> codecs.utf_16_decode( codecs.BOM*10 + "a".encode("UTF-16") + codecs.BOM*10)
(u'a', 44)
--
Take a recipe. Leave a recipe.
Python Cookbook! http://www.ActiveState.com/pythoncookbook