[I18n-sig] UTF-8 and BOM

M.-A. Lemburg mal@lemburg.com
Tue, 22 May 2001 10:57:43 +0200

"M.-A. Lemburg" wrote:
> "Martin v. Loewis" wrote:
> >
> > > That's hard to implement... how would the codec know where the
> > > stream starts -- it only interfaces to the underyling stream
> > > using .read() and .write() ?
> >
> > The stream readers and writers should assume that the first read and
> > write operation use the ZWNBSP as the BOM, so they should stop giving
> > a byte-order meaning to the BOM once they have seen the first chunk of
> > data. That is best implemented by replacing the .encode function with
> > utf_16_be/le_encode (as appropriate).
Patches are welcome :-)

> > > Note that this only happens in the UTF-16 codec. All other codecs
> > > pass through the BOMs as-is. Perhaps I should modify the UTF-16
> > > codec to only remove BOMs when used in UTF-16 mode (without byte
> > > order indication) and not in UTF-16-LE/UTF-16-BE mode ?!
> >
> > You may want to study the RFC just to be sure, but I think this is how
> > UTF-16-[BL]E are defined.
> According to the Unicode FAQ, BOM marks should only be used
> where the byte order is not immediatly clear. In the case -LE and
> -BE, this information is available, which is why the codecs
> don't prepend a BOM mark.
> Ok, I will modify the UTF-16-LE and -BE decoders to not remove
> BOMs anymore and fix the UTF-16 decoder to only remove BOMs at
> the start of the string. With these changes you should be able
> to fix the UTF-16 stream codec to be more RFC compliant.

Done. See the CVS versions of Misc/NEWS and Include/unicodeobject.h
for details.

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/