[Python-Dev] Unicode byte order mark decoding

M.-A. Lemburg mal at egenix.com
Tue Apr 5 12:19:49 CEST 2005

Martin v. Löwis wrote:
> Stephen J. Turnbull wrote:
>> So there is a standard for the UTF-8 signature, and I know of
>> applications which produce it.  While I agree with you that Python's
>> codecs shouldn't produce it (by default), providing an option to strip
>> is a good idea.
> I would personally like to see an "utf-8-bom" codec (perhaps better
> named "utf-8-sig", which strips the BOM on reading (if present)
> and generates it on writing.


>> However, this option should be part of the initialization of an IO
>> stream which produces Unicodes, _not_ an operation on arbitrary
>> internal strings (whether raw or Unicode).
> With the UTF-8-SIG codec, it would apply to all operation modes of
> the codec, whether stream-based or from strings. Whether or not to
> use the codec would be the application's choice.

I'd suggest to use the same mode of operation as we have in
the UTF-16 codec: it removes the BOM mark on the first call
to the StreamReader .decode() method and writes a BOM mark
on the first call to .encode() on a StreamWriter.

Note that the UTF-16 codec is strict w/r to the presence
of the BOM mark: you get a UnicodeError if a stream does
not start with a BOM mark. For the UTF-8-SIG codec, this
should probably be relaxed to not require the BOM.

Marc-Andre Lemburg

Professional Python Services directly from the Source  (#1, Apr 05 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

More information about the Python-Dev mailing list