[Python-Dev] Unicode byte order mark decoding

Tue Apr 5 22:43:03 CEST 2005

Evan Jones sagte:
> On Apr 5, 2005, at 15:33, Walter Dörwald wrote:
>> The stateful decoder has a little problem: At least three bytes
>> have to be available from the stream until the StreamReader
>> decides whether these bytes are a BOM that has to be skipped.
>> This means that if the file only contains "ab", the user will
>> never see these two characters.
>
> Shouldn't the decoder be capable of doing a partial match and quitting  early? After all, "ab" is encoded in UTF8 as <61>
> <62> but the BOM is  <ef> <bb> <bf>. If it did this type of partial matching, this issue  would be avoided except in rare
> situations.
>
>> A solution for this would be to add an argument named final to
>> the decode and read methods that tells the decoder that the
>> stream has ended and the remaining buffered bytes have to be
>> handled now.
>
> This functionality is provided by a flush() method on similar objects,  such as the zlib compression objects.

Theoretically the name is unimportant, but read(..., final=True) or flush()
or close() should subject the pending bytes to normal error handling and
must return the result of decoding these pending bytes just like the
other methods do. This would mean that we would have to implement
a decodecode(), a readclose() and a readlineclose(). IMHO it would be
best to add this argument to decode, read and readline directly. But I'm
not sure, what this would mean for iterating through a StreamReader.

Bye,
    Walter Dörwald