[I18n-sig] UTF-8 and BOM

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Wed, 16 May 2001 23:07:49 +0200


> Python 2.1's UTF-8 decoder seems to treat the BOM as a real leading
> character. The UTF-16 decoder removes it. I recognize that the BOM is
> not useful as a "byte order mark" for UTF-8 data but I would still
> suggest that the UTF-8 decoder should remove it for these reasons:

I think it is good to remove the BOM when decoding UTF-8. Most likely,
the only reason that this is not done is that nobody thought that
there might be one.

I disagree that putting the BOM into a file is a good thing - I think
it is stupid to do so. First of all, auto-detection can always be
fooled, so there should be a higher-level protocol for reliable data
processing. UTF-8 is relatively easy to auto-detect if you believe in
auto-detection - it's just that looking at the first few bytes it not
sufficient.

OTOH, UTF-8 is concatenation-safe: you can reliably concatenate two
UTF-8 files to get another UTF-8 file. That properly is lost if there
is a BOM in the file.

Regards,
Martin