
M.-A. Lemburg wrote:
Walter Dörwald wrote:
Martin v. Löwis wrote:
[...] We already have an efficient way to communicate incompleteness: the decode method returns the number of decoded bytes.
The questions remaining are
1) communicate to whom? IMHO the info should only be used internally by the StreamReader.
Handling incompleteness should be something for the codec to deal with.
Absolutely. This means that decode() should not be called by the user. (But the implementation of read() (and feed(), if we have it) calls it.)
The queue doesn't have to know about it in an way. However, the queue should have interfaces allowing the codec to tell whether there are more bytes waiting to be processed.
This won't work when the byte stream wrapped by the StreamReader is not a queue. (Or do you want the wrap the byte stream in a queue? This would be three wrapping layers.) And the information is not really useful, because it might change (e.g. when the user puts additional data into the queue/stream.)
2) When is incompleteness OK? Incompleteness is of course not OK in the stateless API. For the stateful API, incompleteness has to be OK even when the input stream is (temporarily) exhausted, because otherwise a feed mode wouldn't work anyway. But then incompleteness is always OK, because the StreamReader can't distinguish a temporarily exhausted input stream from a permanently exhausted one. The only fix for this I can think of is the final argument.
A final argument may be the way to go. But it should be an argument for the .read() method (not only the .decode() method) since that's the method reading the data from the queue.
Yes. E.g. the low level charmap decode function doesn't need the final argument, because there is zero state to be kept between calls.
I'd suggest that we extend the existing encode and decode codec APIs to take an extra state argument that holds the codec state in whatever format the codec needs (e.g. this could be a tuple or a special object):
encode(data, errors='strict', state=None) decode(data, errors='strict', state=None)
We don't need a specification for that. The stateless API doesn't need an explicit state (the state is just a bunch of variables at the C level) and for the stateful API the state can be put into StreamReader attributes. How this state looks is totally up to the StreamReader itself (see the UTF-7 reader in the patch for an example). If the stream reader passes on this state to a low level decoding function implemented in C, how this state info looks is again totally up to the codec. So I think we don't have to specify anything in this area.
In the case of the .read() method, decode() would be called. If the returned length_consumed does not match the length of the data input, the remaining items would have to be placed back onto the queue in non-final mode. In final mode an exception would be raised to signal the problem.
Yes, in non-final mode the bytes would have to be retained and in final mode an exception is raised (except when the error handling callback does something else). But I don't think we should put a queue between the byte stream and the StreamReader (at least not in the sense of a queue as another file like object). The remaining items can be kept in an attribute of the StreamReader instance, that's what --- data = self.bytebuffer + newdata object, decodedbytes = self.decode(data, self.errors) self.bytebuffer = data[decodedbytes:] --- does in the patch. The first line combines the items retained from the last call with those read from the stream (or passed to the feed method). The second line does semi-stateful decoding of those bytes. The third line puts the new remaining items back into the buffer. The decoding is "semi-stateful", because the info about the remaining bytes is not stored by decode itself, but by the caller of decode. feed() is the method that does fully stateful decoding of byte chunks.
I think it's PEP time for this new extension. If time permits I'll craft an initial version over the weekend.
I'm looking forward to the results. Bye, Walter Dörwald