M.-A. Lemburg wrote:
Martin v. Löwis wrote:
M.-A. Lemburg wrote:
I've thought about this some more. Perhaps I'm still missing something, but wouldn't it be possible to add a feeding mode to the existing stream codecs by creating a new queue data type (much like the queue you have in the test cases of your patch) and using the stream codecs on these ?
Here is the problem. In UTF-8, how does the actual algorithm tell (the application) that the bytes it got on decoding provide for three fully decodable characters, and that 2 bytes are left undecoded, and that those bytes are not inherently ill-formed, but lack a third byte to complete the multi-byte sequence?
This state can be stored in the stream codec instance, e.g. by using a special state object that is stored in the instance and passed to the encode/decode APIs of the codec or by implementing the stream codec itself in C.
That's exactly what my patch does. The state (the bytes that have already been read from the input stream, but couldn't be decoded and have to be used on the next call to read()) are stored in the bytebuffer attribute of the StreamReader. Most stateful decoder use this type of state, the only one I can think of that uses more than this is the UTF-7 decoder, where the decoder decodes partial +xxxx- sequences, but then has to keep the current shift state and the partially consumed bits and bytes. This decoder could be changed, so that it works with only a byte buffer too, but that would mean that the decoder doesn't enter incomplete +xxxx- sequences, but retains them in the byte buffer and only decodes them once the "-" is encountered.
In fact a trivial implementation of any stateful decoder could put *everything* it reads into the bytebuffer when final==False and decode itin one go once read() is called with final==True.
But IMHO each decoder should decode as much as possible.
We do need to extend the API between the stream codec and the encode/decode functions, no doubt about that. However, this is an extension that is well hidden from the user of the codec and won't break code.
Exactly: this shouldn't be anything officially documented, because what kind of data is passed around depends on the codec. And when the stream reader is implemented in C there isn't any API anyway.
On top of that, you can implement whatever queuing or streaming APIs you want, but you *need* an efficient way to communicate incompleteness.
Bye, Walter Dörwald