[I18n-sig] UTF-8 and BOM

M.-A. Lemburg mal@lemburg.com
Wed, 16 May 2001 23:27:13 +0200


"Martin v. Loewis" wrote:
> 
> > Python 2.1's UTF-8 decoder seems to treat the BOM as a real leading
> > character. The UTF-16 decoder removes it. I recognize that the BOM is
> > not useful as a "byte order mark" for UTF-8 data but I would still
> > suggest that the UTF-8 decoder should remove it for these reasons:
> 
> I think it is good to remove the BOM when decoding UTF-8. Most likely,
> the only reason that this is not done is that nobody thought that
> there might be one.
> 
> I disagree that putting the BOM into a file is a good thing - I think
> it is stupid to do so. First of all, auto-detection can always be
> fooled, so there should be a higher-level protocol for reliable data
> processing. UTF-8 is relatively easy to auto-detect if you believe in
> auto-detection - it's just that looking at the first few bytes it not
> sufficient.
> 
> OTOH, UTF-8 is concatenation-safe: you can reliably concatenate two
> UTF-8 files to get another UTF-8 file. That properly is lost if there
> is a BOM in the file.

Why should a BOM behave any different than any other Unicode
character ? BOMs can be added and deleted in pretty much all
places of a Unicode text -- that's their intent after all, so
I don't see how they could break any property of an encoding.

Or did you have the same misunderstanding as I did ? ... 
Paul is talking about the UTF-8 encoding of the BOM mark ('\xef\xbb\xbf'),
not the FF FE or FE FF byte sequence as is seen in UTF-16 streams.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/