[Python-Dev] Unicode byte order mark decoding

Walter Dörwald walter at livinglogic.de
Wed Apr 6 10:32:59 CEST 2005


Martin v. Löwis sagte:
> Walter Dörwald wrote:
>> There are situations where the byte stream might be temporarily
>> exhausted, e.g. an XML parser that tries to support the
>> IncrementalParser interface, or when you want to decode
>> encoded data piecewise, because you want to give a progress
>> report.
>
> Yes, but these are not file-like objects.

True, on the outside there are no file-like objects. But the
IncrementalParser gets passed the XML bytes in chunks,
so it has to use a stateful decoder for decoding. Unfortunately
this means that is has to use a stream API. (See
http://www.python.org/sf/1101097 for a patch that somewhat
fixes that.)

(Another option would be to completely ignore the stateful API
and handcraft stateful decoding (or only support stateless
decoding), like most XML parsers for Python do now.)

> In the IncrementalParser,
> it is *not* the case that a read operation returns an empty
> string. Instead, the application repeatedly feeds data explicitly.

That's true, but the parser has to wrap this data into an object
that can be passed to the StreamReader constructor. (See the
Queue class in Lib/test/test_codecs.py for an example.)

> For a file-like object, returning "" indicates EOF.

Not neccassarily. In the example above the IncrementalParser
gets fed a chunk of data, it stuffs this data into the Queue,
so that the StreamReader can decode it. Once the data
from the Queue is exhausted, there won't any further
data until the user calls feed() on the IncrementalParser again.

Bye,
   Walter Dörwald





More information about the Python-Dev mailing list