[Python-Dev] Unicode byte order mark decoding

Tue Apr 5 22:37:24 CEST 2005

Martin v. Löwis sagte:
> Walter Dörwald wrote:
>> The stateful decoder has a little problem: At least three bytes
>> have to be available from the stream until the StreamReader
>> decides whether these bytes are a BOM that has to be skipped.
>> This means that if the file only contains "ab", the user will
>> never see these two characters.
>
> This can be improved, of course: If the first byte is "a",
> it most definitely is *not* an UTF-8 signature.
>
> So we only need a second byte for the characters between U+F000
> and U+FFFF, and a third byte only for the characters
> U+FEC0...U+FEFF. But with the first byte being  \xef, we need
> three bytes *anyway*, so we can always decide with the first
> byte only whether we need to wait for three bytes.

OK, I've updated the patch so that the first bytes will only be kept
in the buffer if they are a prefix of the BOM.

>> A solution for this would be to add an argument named final to
>> the decode and read methods that tells the decoder that the
>> stream has ended and the remaining buffered bytes have to be
>> handled now.
>
> Shouldn't an empty read from the underlying stream be taken
> as an EOF?

There are situations where the byte stream might be temporarily
exhausted, e.g. an XML parser that tries to support the
IncrementalParser interface, or when you want to decode
encoded data piecewise, because you want to give a progress
report.

Bye,
   Walter Dörwald