[Python-Dev] Unicode byte order mark decoding

Tue Apr 5 21:33:00 CEST 2005

Walter Dörwald sagte:

> M.-A. Lemburg wrote:
>
>>> [...]
>>>With the UTF-8-SIG codec, it would apply to all operation
>>> modes of the codec, whether stream-based or from strings. Whether
>>>or not to use the codec would be the application's choice.
>>
>> I'd suggest to use the same mode of operation as we have in
>> the UTF-16 codec: it removes the BOM mark on the first call
>> to the StreamReader .decode() method and writes a BOM mark
>> on the first call to .encode() on a StreamWriter.
>>
>> Note that the UTF-16 codec is strict w/r to the presence
>> of the BOM mark: you get a UnicodeError if a stream does
>> not start with a BOM mark. For the UTF-8-SIG codec, this
>> should probably be relaxed to not require the BOM.
>
> I've started writing such a codec. Making the BOM optional
> on decoding definitely simplifies the implementation.

OK, here is the patch: http://www.python.org/sf/1177307

The stateful decoder has a little problem: At least three bytes
have to be available from the stream until the StreamReader
decides whether these bytes are a BOM that has to be skipped.
This means that if the file only contains "ab", the user will
never see these two characters.

A solution for this would be to add an argument named final to
the decode and read methods that tells the decoder that the
stream has ended and the remaining buffered bytes have to be
handled now.

Bye,
   Walter Dörwald