Martin v. Löwis sagte:
Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains "ab", the user will never see these two characters.
This can be improved, of course: If the first byte is "a", it most definitely is *not* an UTF-8 signature.
So we only need a second byte for the characters between U+F000 and U+FFFF, and a third byte only for the characters U+FEC0...U+FEFF. But with the first byte being \xef, we need three bytes *anyway*, so we can always decide with the first byte only whether we need to wait for three bytes.
OK, I've updated the patch so that the first bytes will only be kept in the buffer if they are a prefix of the BOM.
A solution for this would be to add an argument named final to the decode and read methods that tells the decoder that the stream has ended and the remaining buffered bytes have to be handled now.
Shouldn't an empty read from the underlying stream be taken as an EOF?
There are situations where the byte stream might be temporarily exhausted, e.g. an XML parser that tries to support the IncrementalParser interface, or when you want to decode encoded data piecewise, because you want to give a progress report.
Bye, Walter Dörwald