[I18n-sig] UTF-8 and BOM

Paul Prescod paulp@ActiveState.com
Thu, 17 May 2001 09:46:12 -0700


"Martin v. Loewis" wrote:
> 
>...
> 
> There probably is none, although giving them a .txt extension is a
> good starting point. What is the standard for tagging KOI8-R documents
> on the Windows file system?

There isn't one. But utf-8 is an encoding that is growing in popularity
and KOI8-R is one that is shrinking. The unreliability of "code pages"
is a big part of what Unicode is supposed to fix.

> > So what if there is a BOM in the middle of the data stream. MAL's
> > decoder will just remove it anyhow. :)
> 
> Yes, and I think this is a bug.

Nevertheless, I don't see how concatenating two BOM-prefixed UTF-8
streams is any more or less problematic than concatenating two
BOM-prefixed UTF-16 streams.

I'll repeat that I'm not saying that the UTF-8 encoder should add a BOM.
Until this convention is more common, we shouldn't try to be innovative.
But I still think that BOMs on UTF-8 are a good idea.
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook