Mailman 3 Decoding incomplete unicode - Python-Dev

newer
Dealing with test__locale failure...

Decoding incomplete unicode

Walter Dörwald

July 27, 2004

8:39 p.m.

Pythons unicode machinery currently has problems when decoding incomplete input. When codecs.StreamReader.read() encounters a decoding error it reads more bytes from the input stream and retries decoding. This is broken for two reasons: 1) The error might be due to a malformed byte sequence in the input, a problem that can't be fixed by reading more bytes. 2) There may be no more bytes available at this time. Once more data is available decoding can't continue because bytes from the input stream have already been read and thrown away. (sio.DecodingInputFilter has the same problems) I've uploaded a patch that fixes these problems to SF: http://www.python.org/sf/998993 The patch implements a few additional features: - read() has an additional argument chars that can be used to specify the number of characters that should be returned. - readline() is supported on all readers derived from codecs.StreamReader(). - readline() and readlines() have an additional option for dropping the u"\n". The patch is still missing changes to the escape codecs ("unicode_escape" and "raw_unicode_escape") and I haven't touched the CJK codecs, but it has test cases that check the new functionality for all affected codecs (UTF-7, UTF-8, UTF-16, UTF-16-LE, UTF-16-BE). Could someone take a look at the patch? Bye, Walter Dörwald

Show replies by date

M.-A. Lemburg

July 2004

8:59 p.m.

Walter Dörwald wrote:

...

Just did... please see the comments in the SF tracker. I like the idea, but don't think the implementation is the right way to do it. Instead, I'd suggest using a new error handling strategy "break" ( = break processing as soon as errors are found). The advantage of this approach is twofold: * no new APIs or API changes are required * other codecs (including third-party ones) can easily implement the same strategy -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 27 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

"Martin v. Löwis"

9:15 p.m.

M.-A. Lemburg wrote:

...

Can you demonstrate this approach in a patch? I think it is unimplementable: the codec cannot communicate to the error callback that it ran out of data. Regards, Martin

M.-A. Lemburg

9:43 p.m.

Martin v. Löwis wrote:

...

The codec can: the callback gets all the necessary information and can even manipulate the objects being worked on. But you have a point: the current implementations of the various encode/decode functions don't provide interfaces to report back the number of bytes read at C level - the codecs module wrappers add these numbers assuming that all bytes were read. The error callbacks could, however, raise an exception which includes all the needed information, including any state that may be needed in order to continue with coding operation. We may then need to allow additional keyword arguments on the encode/decode functions in order to preset a start state. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 27 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

9:19 a.m.

M.-A. Lemburg wrote:

...

Martin v. Löwis wrote:

...
M.-A. Lemburg wrote:

...
I like the idea, but don't think the implementation is the right way to do it. Instead, I'd suggest using a new error handling strategy "break" ( = break processing as soon as errors are found).

Can you demonstrate this approach in a patch? I think it is unimplementable: the codec cannot communicate to the error callback that it ran out of data.

The codec can: the callback gets all the necessary information and can even manipulate the objects being worked on.

But you have a point: the current implementations of the various encode/decode functions don't provide interfaces to report back the number of bytes read at C level - the codecs module wrappers add these numbers assuming that all bytes were read.

This is the correct thing to do for the stateless decoders: any incomplete byte sequence at the end of the input is an error. But then it doesn't make sense to return the number of bytes decoded for the stateless decoder, because this is always the size of the input. For the stateful decoder this is just some sort of state common to all decoders: the decoder keeps the incomplete byte sequence to be used in the next call. But then this state should be internal to the decoder and not part of the public API. This would make the decode() method more usable: When you want to implement an XML parser that supports the xml.sax.xmlreader.IncrementalParser interface, you have an API mismatch. The parser has to use the stateful decoding API (i.e. read()), which means the input is in the form of a byte stream, but this interface expects it's input as byte chunks passed to multiple calls to the feed() method. If StreamReader.decode() simply returned the decoded unicode object and keep the remaining undecoded bytes as an internal state then StreamReader.decode() would be directly usable. But this isn't really a "StreamReader" any more, so the best solution would probably be to have a three level API: * A stateless decoding function (what codecs.getdecoder returns now); * A stateful "feed reader", which keeps internal state (including undecoded byte sequences) and gets passed byte chunks (should this feed reader have a error attribute or should this be an argument to the feed method?); * A stateful stream reader that reads its input from a byte stream. The functionality for the stream reader could be implemented once using the underlying functionality of the feed reader (in fact we could implement something similar to sio's stacking streams: the stream reader would use the feed reader to wrap the byte input stream and add only a read() method. The line reading methods (readline(), readlines() and __iter__() could be added by another stream filter)

...

The error callbacks could, however, raise an exception which includes all the needed information, including any state that may be needed in order to continue with coding operation.

This makes error callbacks effectively unusable with stateful decoders.

...

We may then need to allow additional keyword arguments on the encode/decode functions in order to preset a start state.

As those decoding functions are private to the decoder that's probably OK. I wouldn't want to see additional keyword arguments on str.decode (which uses the stateless API anyway). BTW, that's exactly what I did for codecs.utf_7_decode_stateful, but I'm not really comfortable with the internal state of the UTF-7 decoder being exposed on the Python level. It would be better to encapsulate the state in a feed reader implemented in C, so that the state is inaccessible from the Python level. Bye, Walter Dörwald

M.-A. Lemburg

9:42 a.m.

Walter Dörwald wrote:

...

The reason why stateless encode and decode APIs return the number of input items consumed is to accomodate for error handling situations like these where you want to stop coding and leave the remaining work to another step. The C implementation currently doesn't make use of this feature.

...

Please don't mix "StreamReader" with "decoder". The codecs module returns 4 different objects if you ask it for a codec set: two stateless APIs for encoding and decoding and two factory functions for creating possibly stateful objects which expose a stream interface. Your "stateful decoder" is actually part of a StreamReader implementation and doesn't have anything to do with the stateless decoder. I see two possibilities here: 1. you write a custom StreamReader/Writer implementation for each of the codecs which takes care of keeping state and encoding/decoding as much as possible 2. you extend the existing stateless codec implementations to allow communicating state on input and output; the stateless operation would then be a special case

...

Why make things more complicated ? If you absolutely need a feed interface, you can feed your data to a StringIO instance which is then read from by StreamReader.

...

Could you explain ?

...

See above: possibility 1 would be the way to go then. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 28 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

4:55 p.m.

M.-A. Lemburg wrote:

...

Which in most cases is the read method.

...

I know. I'd just like to have a stateful decoder that doesn't use a stream interface. The stream interface could be built on top of that without any knowlegde of the encoding. I wonder whether the decode method is part of the public API for StreamReader.

...

But I'd like to reuse at least some of the functionality from PyUnicode_DecodeUTF8() etc. Would a version of PyUnicode_DecodeUTF8() with an additional PyUTF_DecoderState * be OK?

...

This doesn't work, because a StringIO has only one file position:

...

But something like the Queue class from the tests in the patch might work.

...

If you have to call the decode function with errors='break', you will only get the break error handling and nothing else.

...

I might give this a try. Bye, Walter Dörwald

M.-A. Lemburg

6:31 p.m.

Walter Dörwald wrote:

...

M.-A. Lemburg wrote:

...
Walter Dörwald wrote:

...
This is the correct thing to do for the stateless decoders: any incomplete byte sequence at the end of the input is an error. But then it doesn't make sense to return the number of bytes decoded for the stateless decoder, because this is always the size of the input.

The reason why stateless encode and decode APIs return the number of input items consumed is to accomodate for error handling situations like these where you want to stop coding and leave the remaining work to another step.

Which in most cases is the read method.

The read method only happens to use the stateless encode and decode methods. There nothing in the design spec that mandates this, though.

...

...
The C implementation currently doesn't make use of this feature.

...
For the stateful decoder this is just some sort of state common to all decoders: the decoder keeps the incomplete byte sequence to be used in the next call. But then this state should be internal to the decoder and not part of the public API. This would make the decode() method more usable: When you want to implement an XML parser that supports the xml.sax.xmlreader.IncrementalParser interface, you have an API mismatch. The parser has to use the stateful decoding API (i.e. read()), which means the input is in the form of a byte stream, but this interface expects it's input as byte chunks passed to multiple calls to the feed() method. If StreamReader.decode() simply returned the decoded unicode object and keep the remaining undecoded bytes as an internal state then StreamReader.decode() would be directly usable.

Please don't mix "StreamReader" with "decoder". The codecs module returns 4 different objects if you ask it for a codec set: two stateless APIs for encoding and decoding and two factory functions for creating possibly stateful objects which expose a stream interface.

Your "stateful decoder" is actually part of a StreamReader implementation and doesn't have anything to do with the stateless decoder.

I know. I'd just like to have a stateful decoder that doesn't use a stream interface. The stream interface could be built on top of that without any knowlegde of the encoding.

I wonder whether the decode method is part of the public API for StreamReader.

It is: StreamReader/Writer are "sub-classes" of the Codec class. However, there's nothing stating that .read() or .write() *must* use these methods to do their work and that's intentional.

...

...
I see two possibilities here:

1. you write a custom StreamReader/Writer implementation for each of the codecs which takes care of keeping state and encoding/decoding as much as possible

But I'd like to reuse at least some of the functionality from PyUnicode_DecodeUTF8() etc.

Would a version of PyUnicode_DecodeUTF8() with an additional PyUTF_DecoderState * be OK?

Before you start putting more work into this, let's first find a good workable approach.

...

...
2. you extend the existing stateless codec implementations to allow communicating state on input and output; the stateless operation would then be a special case

...
But this isn't really a "StreamReader" any more, so the best solution would probably be to have a three level API: * A stateless decoding function (what codecs.getdecoder returns now); * A stateful "feed reader", which keeps internal state (including undecoded byte sequences) and gets passed byte chunks (should this feed reader have a error attribute or should this be an argument to the feed method?); * A stateful stream reader that reads its input from a byte stream. The functionality for the stream reader could be implemented once using the underlying functionality of the feed reader (in fact we could implement something similar to sio's stacking streams: the stream reader would use the feed reader to wrap the byte input stream and add only a read() method. The line reading methods (readline(), readlines() and __iter__() could be added by another stream filter)

Why make things more complicated ?

If you absolutely need a feed interface, you can feed your data to a StringIO instance which is then read from by StreamReader.

This doesn't work, because a StringIO has only one file position:

...
...
...
import cStringIO s = cStringIO.StringIO() s.write("x") s.read() ''

Ah, you wanted to do both feeding and reading at the same time ?!

...

But something like the Queue class from the tests in the patch might work.

Right... I don't think that we need a third approach to codecs just to implement feed based parsers.

...

...
...
...
The error callbacks could, however, raise an exception which includes all the needed information, including any state that may be needed in order to continue with coding operation.

This makes error callbacks effectively unusable with stateful decoders.

Could you explain ?

If you have to call the decode function with errors='break', you will only get the break error handling and nothing else.

Yes and ... ? What else do you want it to do ?

...

...
...
...
We may then need to allow additional keyword arguments on the encode/decode functions in order to preset a start state.

As those decoding functions are private to the decoder that's probably OK. I wouldn't want to see additional keyword arguments on str.decode (which uses the stateless API anyway). BTW, that's exactly what I did for codecs.utf_7_decode_stateful, but I'm not really comfortable with the internal state of the UTF-7 decoder being exposed on the Python level. It would be better to encapsulate the state in a feed reader implemented in C, so that the state is inaccessible from the Python level.

See above: possibility 1 would be the way to go then.

I might give this a try.

Again, please wait until we have found a good solution to this. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 28 2004)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

8:30 p.m.

M.-A. Lemburg wrote:

...

But then this turns into a stateful decoder. What would happen when stateless decoders suddenly started to decode less than the complete string? Every user would have to check whether decoder(foo)[1] == len(foo).

...

Any read() method can be implemented on top of a stateful decode() method.

...

I agree that we need a proper design for this that gives us the most convenient codec API with breaking backwards compatibility (at least not for codec users). Breaking compatibility for codec implementers shouldn't be an issue. I'll see if I can come up with something over the weekend.

...

There is no other way. You pass the feeder byte string chunks and it returns the chunks of decoded objects. With the StreamReader the reader itself will read those chunks from the underlying stream. Implementing a stream reader interface on top of a feed interface is trivial: Basically our current decode method *is* the feed interface, the only problem is that the user has to keep state (the undecoded bytes that have to be passed to the next call to decode). Move that state into an attribute of the instance and drop it from the return value and you have a feed interface.

...

We already have most of the functionality in the decode method.

...

The user can pass any value for the errors argument in the StreamReader constructor. The StreamReader should always honor this error handling strategy. Example. import codecs, cStringIO count = 0 def countandreplace(exc): global count if not isinstance(exc, UnicodeDecodeError): raise TypeError("can handle error") count += 1 return (u"\ufffd", exc.end) codecs.register_error("countandreplace", countandreplace) s = cStringIO.StringIO("\xc3foo\xffbar\xc3") us = codecs.getreader("utf-8")(s) The first \xc3 and the \xff are real errors, the trailing \xc3 might be a transient one. To handle this with the break handler strategy the StreamReader would have to call the decode() method with errors="break" instead of errors="countandreplace". break would then have to decide whether it's a transient error or a real one (presumably from some info in the exception). If it's a real one it would have to call the original error handler, but it doesn't have any way of knowing what the original error handler was. If it's a transient error, it would have to comunicate this fact to the caller, which could be done by changing an attribute in the exception object. But the decoding function still has to put the retained bytes into the StreamReader so that part doesn't get any simpler. Alltogether I find this method rather convoluted, especially since we have most of the machinery in place. What is missing is the implementation of real stateful decoding functions.

...

OK. Bye, Walter Dörwald

Walter Dörwald

August 2004

7:24 p.m.

OK, here a my current thoughts on the codec problem: The optimal solution (ignoring backwards compatibility) would look like this: codecs.lookup() would return the following stuff (this could be done by replacing the 4 entry tuple with a real object): * decode: The stateless decoding function * encode: The stateless encocing function * chunkdecoder: The stateful chunk decoder * chunkencoder: The stateful chunk encoder * streamreader: The stateful stream decoder * streamwriter: The stateful stream encoder The functions and classes look like this: Stateless decoder: decode(input, errors='strict'): Function that decodes the (str) input object and returns a (unicode) output object. The decoder must decode the complete input without any remaining undecoded bytes. Stateless encoder: encode(input, errors='strict'): Function that encodes the complete (unicode) input object and returns a (str) output object. Stateful chunk decoder: chunkdecoder(errors='strict'): A factory function that returns a stateful decoder with the following method: decode(input, final=False): Decodes a chunk of input and return the decoded unicode object. This method can be called multiple times and the state of the decoder will be kept between calls. This includes trailing incomplete byte sequences that will be retained until the next call to decode(). When the argument final is true, this is the last call to decode() and trailing incomplete byte sequences will not be retained, but a UnicodeError will be raised. Stateful chunk encoder: chunkencoder(errors='strict'): A factory function that returns a stateful encoder with the following method: encode(input, final=False): Encodes a chunk of input and returns the encoded str object. When final is true this is the last call to encode(). Stateful stream decoder: streamreader(stream, errors='strict'): A factory function that returns a stateful decoder for reading from the byte stream stream, with the following methods: read(size=-1, chars=-1, final=False): Read unicode characters from the stream. When data is read from the stream it should be done in chunks of size bytes. If size == -1 all the remaining data from the stream is read. chars specifies the number of characters to read from the stream. read() may return less then chars characters if there's not enough data available in the byte stream. If chars == -1 as much characters are read as are available in the stream. Transient errors are ignored and trailing incomplete byte sequence are retained when final is false. Otherwise a UnicodeError is raised in the case of incomplete byte sequences. readline(size=-1): ... next(): ... __iter__(): ... Stateful stream encoder: streamwriter(stream, errors='strict'): A factory function that returns a stateful encoder for writing unicode data to the byte stream stream, with the following methods: write(data, final=False): Encodes the unicode object data and writes it to the stream. If final is true this is the last call to write(). writelines(data): ... I know that this is quite a departure from the current API, and I'm not sure if we can get all of the functionality without sacrificing backwards compatibility. I don't particularly care about the "bytes consumed" return value from the stateless codec. The codec should always have returned only the encoded/decoded object, but I guess fixing this would break too much code. And users who are only interested in the stateless functionality will probably use unicode.encode/str.decode anyway. For the stateful API it would be possible to combine the chunk and stream decoder/encode into one class with the following methods (for the decoder): __init__(stream, errors='strict'): Like the current StreamReader constructor, but stream may be None, if only the chunk API is used. decode(input, final=False): Like the current StreamReader (i.e. it returns a (unicode, int) tuple.) This does not keep the remaining bytes in a buffer. This is the job of the caller. feed(input, final=False): Decodes input and returns a decoded unicode object. This method calls decode() internally and manages the byte buffer. read(size=-1, chars=-1, final=False): readline(size=-1): next(): __iter__(): See above. As before implementers of decoders only need to implement decode(). To be able to support the final argument the decoding functions in _codecsmodule.c could get an additional argument. With this they could be used for the stateless codecs too and we can reduce the number of functions again. Unfortunately adding the final argument breaks all of the current codecs, but dropping the final argument requires one of two changes: 1) When the input stream is exhausted, the bytes read are parsed as if final=True. That's the way the CJK codecs currently handle it, but unfortunately this doesn't work with the feed decoder. 2) Simply ignore any remaing undecoded bytes at the end of the stream. If we really have to drop the final argument, I'd prefer 2). I've uploaded a second version of the patch. It implements the final argument, adds the feed() method to StreamReader and again merges the duplicate decoding functions in the codecs module. Note that the patch isn't really finished (the final argument isn't completely supported in the encoders and the CJK and escape codecs are unchanged), but it should be sufficient as a base for discussion. Bye, Walter Dörwald

M.-A. Lemburg

1:15 p.m.

Hi Walter, I don't have time to comment on this this week, I'll respond next week. Overall, I don't like the idea of adding extra APIs breaking the existing codec API. I believe that we can extend stream codecs to also work in a feed mode without breaking the API. Walter Dörwald wrote:

...

OK, here a my current thoughts on the codec problem:

The optimal solution (ignoring backwards compatibility) would look like this: codecs.lookup() would return the following stuff (this could be done by replacing the 4 entry tuple with a real object):

* decode: The stateless decoding function * encode: The stateless encocing function * chunkdecoder: The stateful chunk decoder * chunkencoder: The stateful chunk encoder * streamreader: The stateful stream decoder * streamwriter: The stateful stream encoder

The functions and classes look like this:

Stateless decoder: decode(input, errors='strict'): Function that decodes the (str) input object and returns a (unicode) output object. The decoder must decode the complete input without any remaining undecoded bytes.

Stateless encoder: encode(input, errors='strict'): Function that encodes the complete (unicode) input object and returns a (str) output object.

Stateful chunk decoder: chunkdecoder(errors='strict'): A factory function that returns a stateful decoder with the following method:

decode(input, final=False): Decodes a chunk of input and return the decoded unicode object. This method can be called multiple times and the state of the decoder will be kept between calls. This includes trailing incomplete byte sequences that will be retained until the next call to decode(). When the argument final is true, this is the last call to decode() and trailing incomplete byte sequences will not be retained, but a UnicodeError will be raised.

Stateful chunk encoder: chunkencoder(errors='strict'): A factory function that returns a stateful encoder with the following method: encode(input, final=False): Encodes a chunk of input and returns the encoded str object. When final is true this is the last call to encode().

Stateful stream decoder: streamreader(stream, errors='strict'): A factory function that returns a stateful decoder for reading from the byte stream stream, with the following methods:

read(size=-1, chars=-1, final=False): Read unicode characters from the stream. When data is read from the stream it should be done in chunks of size bytes. If size == -1 all the remaining data from the stream is read. chars specifies the number of characters to read from the stream. read() may return less then chars characters if there's not enough data available in the byte stream. If chars == -1 as much characters are read as are available in the stream. Transient errors are ignored and trailing incomplete byte sequence are retained when final is false. Otherwise a UnicodeError is raised in the case of incomplete byte sequences. readline(size=-1): ... next(): ... __iter__(): ...

Stateful stream encoder: streamwriter(stream, errors='strict'): A factory function that returns a stateful encoder for writing unicode data to the byte stream stream, with the following methods:

write(data, final=False): Encodes the unicode object data and writes it to the stream. If final is true this is the last call to write(). writelines(data): ...

I know that this is quite a departure from the current API, and I'm not sure if we can get all of the functionality without sacrificing backwards compatibility.

I don't particularly care about the "bytes consumed" return value from the stateless codec. The codec should always have returned only the encoded/decoded object, but I guess fixing this would break too much code. And users who are only interested in the stateless functionality will probably use unicode.encode/str.decode anyway.

For the stateful API it would be possible to combine the chunk and stream decoder/encode into one class with the following methods (for the decoder):

__init__(stream, errors='strict'): Like the current StreamReader constructor, but stream may be None, if only the chunk API is used. decode(input, final=False): Like the current StreamReader (i.e. it returns a (unicode, int) tuple.) This does not keep the remaining bytes in a buffer. This is the job of the caller. feed(input, final=False): Decodes input and returns a decoded unicode object. This method calls decode() internally and manages the byte buffer. read(size=-1, chars=-1, final=False): readline(size=-1): next(): __iter__(): See above.

As before implementers of decoders only need to implement decode().

To be able to support the final argument the decoding functions in _codecsmodule.c could get an additional argument. With this they could be used for the stateless codecs too and we can reduce the number of functions again.

Unfortunately adding the final argument breaks all of the current codecs, but dropping the final argument requires one of two changes: 1) When the input stream is exhausted, the bytes read are parsed as if final=True. That's the way the CJK codecs currently handle it, but unfortunately this doesn't work with the feed decoder. 2) Simply ignore any remaing undecoded bytes at the end of the stream.

If we really have to drop the final argument, I'd prefer 2).

I've uploaded a second version of the patch. It implements the final argument, adds the feed() method to StreamReader and again merges the duplicate decoding functions in the codecs module. Note that the patch isn't really finished (the final argument isn't completely supported in the encoders and the CJK and escape codecs are unchanged), but it should be sufficient as a base for discussion.

Bye, Walter Dörwald

_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/mal%40egenix.com

-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 12 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

8:04 p.m.

M.-A. Lemburg wrote:

...

Hi Walter,

I don't have time to comment on this this week, I'll respond next week.

OK.

...

Overall, I don't like the idea of adding extra APIs breaking the existing codec API.

Adding a final argument that defaults to False doesn't break the API for the callers, only for the implementors. And if we drop the final argument, the API is completely backwards compatible both for users and implementors. The only thing that gets added is the feed() method, that implementors don't have to overwrite.

...

I believe that we can extend stream codecs to also work in a feed mode without breaking the API.

Abandoning the final argument and adding a feed() method would IMHO be the simplest way to do this. But then there's no way to make sure that every byte from the input stream is really consumed. Bye, Walter Dörwald

M.-A. Lemburg

10:25 p.m.

Walter Dörwald wrote:

...

I've thought about this some more. Perhaps I'm still missing something, but wouldn't it be possible to add a feeding mode to the existing stream codecs by creating a new queue data type (much like the queue you have in the test cases of your patch) and using the stream codecs on these ? I think such a queue would be generally useful in other contexts as well, e.g. for implementing fast character based pipes between threads, non-Unicode feeding parsers, etc. Using such a type you could potentially add a feeding mode to stream or file-object API based algorithms very easily. We could then have a new class, e.g. FeedReader, which wraps the above in a nice API, much like StreamReaderWriter and StreamRecoder. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 17 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

"Martin v. Löwis"

4:57 a.m.

M.-A. Lemburg wrote:

...

Here is the problem. In UTF-8, how does the actual algorithm tell (the application) that the bytes it got on decoding provide for three fully decodable characters, and that 2 bytes are left undecoded, and that those bytes are not inherently ill-formed, but lack a third byte to complete the multi-byte sequence? On top of that, you can implement whatever queuing or streaming APIs you want, but you *need* an efficient way to communicate incompleteness. Regards, Martin

M.-A. Lemburg

8:36 a.m.

Martin v. Löwis wrote:

...

This state can be stored in the stream codec instance, e.g. by using a special state object that is stored in the instance and passed to the encode/decode APIs of the codec or by implementing the stream codec itself in C. We do need to extend the API between the stream codec and the encode/decode functions, no doubt about that. However, this is an extension that is well hidden from the user of the codec and won't break code.

...

Agreed. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 18 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

8:12 p.m.

M.-A. Lemburg wrote:

...

That's exactly what my patch does. The state (the bytes that have already been read from the input stream, but couldn't be decoded and have to be used on the next call to read()) are stored in the bytebuffer attribute of the StreamReader. Most stateful decoder use this type of state, the only one I can think of that uses more than this is the UTF-7 decoder, where the decoder decodes partial +xxxx- sequences, but then has to keep the current shift state and the partially consumed bits and bytes. This decoder could be changed, so that it works with only a byte buffer too, but that would mean that the decoder doesn't enter incomplete +xxxx- sequences, but retains them in the byte buffer and only decodes them once the "-" is encountered. In fact a trivial implementation of any stateful decoder could put *everything* it reads into the bytebuffer when final==False and decode itin one go once read() is called with final==True. But IMHO each decoder should decode as much as possible.

...

Exactly: this shouldn't be anything officially documented, because what kind of data is passed around depends on the codec. And when the stream reader is implemented in C there isn't any API anyway.

...

Bye, Walter Dörwald

"Martin v. Löwis"

8:30 p.m.

...

So you agree to the part of Walter's change that introduces new C functions (PyUnicode_DecodeUTF7Stateful etc)? I think most of the patch can be discarded: there is no need for .encode and .decode to take an additional argument. It is only necessary that the StreamReader and StreamWriter are stateful, and that only for a selected subset of codecs. Marc-Andre, if the original patch (diff.txt) was applied: What *specific* change in that patch would break code? What *specific* code (C or Python) would break under that change? I believe the original patch can be applied as-is, and does not cause any breakage. It also introduces a change between the codec and the encode/decode functions that is well hidden from the user of the codec. Regards, Martin

Walter Dörwald

9:17 p.m.

Martin v. Löwis wrote:

...

But then a file that contains the two bytes 0x61, 0xc3 will never generate an error when read via an UTF-8 reader. The trailing 0xc3 will just be ignored. Another option we have would be to add a final() method to the StreamReader, that checks if all bytes have been consumed. Maybe this should be done by StreamReader.close()?

...

The first version has a broken implementation of the UTF-7 decoder. When decoding the byte sequence "+-" in two calls to decode() (i.e. pass "+" in one call and "-" in the next), no character got generated, because inShift (as a flag) couldn't remember whether characters where encountered between the "+" and the "-". Now inShift counts the number of characters (and the shortcut for a "+-" sequence appearing together has been removed.

...

Would a version of the patch without a final argument but with a feed() method be accepted? I'm imagining implementing an XML parser that uses Python's unicode machinery and supports the xml.sax.xmlreader.IncrementalParser interface. With a feed() method in the stream reader this is rather simple: init() { PyObject *reader = PyCodec_StreamReader(encoding, Py_None, NULL); self.reader = PyObject_CallObject(reader, NULL); } int feed(char *bytes) { parse(PyObject_CallMethod(self.reader, "feed", "s", bytes); } The feed method itself is rather simple (see the second version of the patch). Without the feed method(), we need the following: 1) A StreamQueue class that a) supports writing at one end and reading at the other end b) has a method for pushing back unused bytes to be returned in the next call to read() 2) A StreamQueueWrapper class that a) gets passed the StreamReader factory in the constructor, creates a StreamQueue instance, puts it into an attribute and passes it to the StreamReader factory (which must also be put into an attribute). b) has a feed() method that calls write() on the stream queue and read() on the stream reader and returns the result Then the C implementation of the parser looks something like this: init() { PyObject *module = PyImport_ImportModule("whatever"); PyObject *wclass = PyObject_GetAttr(module, "StreamQueueWrapper"); PyObject *reader = PyCodec_StreamReader(encoding, Py_None, NULL); self.wrapper = PyObject_CallObject(wclass, reader); } int feed(char *bytes) { parse(PyObject_CallMethod(self.wrapper, "feed", "s", bytes); } I find this neither easier to implement nor easier to explain. Bye, Walter Dörwald

"Martin v. Löwis"

9:57 p.m.

Walter Dörwald wrote:

...

But then a file that contains the two bytes 0x61, 0xc3 will never generate an error when read via an UTF-8 reader. The trailing 0xc3 will just be ignored.

Another option we have would be to add a final() method to the StreamReader, that checks if all bytes have been consumed.

Alternatively, we could add a .buffer() method that returns any data that are still pending (either a Unicode string or a byte string).

...

Maybe this should be done by StreamReader.close()?

No. There is nothing wrong with only reading a part of a file.

...

Now inShift counts the number of characters (and the shortcut for a "+-" sequence appearing together has been removed.

Ok. I didn't actually check the correctness of the individual methods. OTOH, I think time spent on UTF-7 is wasted, anyway.

...

Would a version of the patch without a final argument but with a feed() method be accepted?

I don't see the need for a feed method. .read() should just block until data are available, and that's it.

...

I'm imagining implementing an XML parser that uses Python's unicode machinery and supports the xml.sax.xmlreader.IncrementalParser interface.

I think this is out of scope of this patch. The incremental parser could implement a regular .read on a StringIO file that also supports .feed.

...

Without the feed method(), we need the following:

1) A StreamQueue class that

Why is that? I thought we are talking about "Decoding incomplete unicode"? Regards, Martin

Walter Dörwald

4:49 p.m.

Martin v. Löwis wrote:

...

Walter Dörwald wrote:

...
But then a file that contains the two bytes 0x61, 0xc3 will never generate an error when read via an UTF-8 reader. The trailing 0xc3 will just be ignored.

Another option we have would be to add a final() method to the StreamReader, that checks if all bytes have been consumed.

Alternatively, we could add a .buffer() method that returns any data that are still pending (either a Unicode string or a byte string).

Both approaches have one problem: Error handling won't work for them. If the error handling is "replace", the decoder should return U+FFFD for the final trailing incomplete sequence in read(). This won't happen when read() never reads those bytes from the input stream.

...

...
Maybe this should be done by StreamReader.close()?

No. There is nothing wrong with only reading a part of a file.

Yes, but if read() is called without arguments then everything from the input stream should be read and used.

...

...
Now inShift counts the number of characters (and the shortcut for a "+-" sequence appearing together has been removed.

Ok. I didn't actually check the correctness of the individual methods.

OTOH, I think time spent on UTF-7 is wasted, anyway.

;) But it's a good example of how complicated state management can get.

...

...
Would a version of the patch without a final argument but with a feed() method be accepted?

I don't see the need for a feed method. .read() should just block until data are available, and that's it.

There are situations where this can never work: Take a look at xml.sax.xmlreader.IncrementalParser. This interface has a feed() method which the user can call multiple times to pass byte string chunks to the XML parser. These chunks have to be decoded by the parser. Now if the parser wants to use Python's StreamReader it has to wrap the bytes passed to the feed() method into a stream interface. This looks something like the Queue class from the patch: class Queue(object): def __init__(self): self._buffer = "" def write(self, chars): self._buffer += chars def read(self, size=-1): if size<0: s = self._buffer self._buffer = "" return s else: s = self._buffer[:size] self._buffer = self._buffer[size:] return s The parser creates such an object and passes it to the StreamReader constructor. Now when feed() is called for the XML parser the parser calls queue.write(bytes) to put the bytes into the queue. Now the parser can call read() on the StreamReader (which in turn will read from the queue (on the other end)) to get decoded data. But this will fail when StreamReader.read() block until more data is available. This will never happen, because the data will be put in the queue explicitely by calls to the feed() method of the XML parser. Or take a look at sio.DecodingInputFilter. This should be an alternative implementation of reading a stream an decoding bytes to unicode. But the current implementation is broken because it uses the stateless API. But once we switch to the stateful API DecodingInputFilter becomes useless: DecodingInputFilter.read() just looks like this: def read(): return self.stream.read() (with stream being the stateful stream reader from codecs.getreader()), because DecodingInputFilter is forced to use the stream API of StreamReader.

...

...
I'm imagining implementing an XML parser that uses Python's unicode machinery and supports the xml.sax.xmlreader.IncrementalParser interface.

I think this is out of scope of this patch. The incremental parser could implement a regular .read on a StringIO file that also supports .feed.

This adds to much infrastructure, when the alternative implementation is trivial. Take a look at the first version of the patch. Implementing a feed() method just mean factoring the lines: data = self.bytebuffer + newdata object, decodedbytes = self.decode(data, self.errors) self.bytebuffer = data[decodedbytes:] into a separate method named feed(): def feed(newdata): data = self.bytebuffer + newdata object, decodedbytes = self.decode(data, self.errors) self.bytebuffer = data[decodedbytes:] return object So the feed functionality does already exist. It's just not in a usable form. A using StringIO wouldn't work because we need both a read and a write position.

...

...
Without the feed method(), we need the following:

1) A StreamQueue class that

Why is that? I thought we are talking about "Decoding incomplete unicode"?

Well, I had to choose a subject. ;) Bye, Walter Dörwald

"Martin v. Löwis"

7:05 p.m.

Walter Dörwald wrote:

...

Ok. So it really looks like a final flag on read is necessary.

...

Well, I had to choose a subject. ;)

I still would prefer if the two issues were discussed separately. Regards, Martin

Walter Dörwald

8:41 p.m.

Martin v. Löwis wrote:

...

OK, let's come up with a patch that fixes the incomplete byte sequences problem and then discuss non-stream APIs. So, what should the next step be? Bye, Walter Dörwald

"Martin v. Löwis"

5:13 a.m.

Walter Dörwald wrote:

...

I think your first patch should be taken as a basis for that. Add the state-supporting decoding C functions, and change the stream readers to use them. That still leaves the issue of the last read operation, which I'm tempted to leave unresolved for Python 2.4. No matter what the solution is, it would likely require changes to all codecs, which is not good. Regards, Martin

M.-A. Lemburg

9:36 a.m.

Martin v. Löwis wrote:

...

Walter Dörwald wrote:

...
OK, let's come up with a patch that fixes the incomplete byte sequences problem and then discuss non-stream APIs.

So, what should the next step be?

I think your first patch should be taken as a basis for that.

We do need a way to communicate state between the codec and Python. However, I don't like the way that the patch implements this state handling: I think we should use a generic "state" object here which is passed to the stateful codec and returned together with the standard return values on output: def decode_stateful(data, state=None): ... decode and modify state ... return (decoded_data, length_consumed, state) where the object type and contents of the state variable is defined per codec (e.g. could be a tuple, just a single integer or some other special object). Otherwise we'll end up having different interface signatures for all codecs and extending them to accomodate for future enhancement will become unfeasable without introducing yet another set of APIs. Let's discuss this some more and implement it for Python 2.5. For Python 2.4, I think we can get away with what we already have: If we leave out the UTF-7 codec changes in the patch, the only state that the UTF-8 and UTF-16 codecs create is the number of bytes consumed. We already have the required state parameter for this in the standard decode API, so no extra APIs are needed for these two codecs. So the patch boils down to adding a few new C APIs and using the consumed parameter in the standard _codecs module APIs instead of just defaulting to the input size (we don't need any new APIs in _codecs).

...

Add the state-supporting decoding C functions, and change the stream readers to use them.

The buffer logic should only be used for streams that do not support the interface to push back already read bytes (e.g. .unread()). From a design perspective, keeping read data inside the codec is the wrong thing to do, simply because it leaves the input stream in an undefined state in case of an error and there's no way to correlate the stream's read position to the location of the error. With a pushback method on the stream, all the stream data will be stored on the stream, not the codec, so the above would no longer be a problem. However, we can always add the .unread() support to the stream codecs at a later stage, so it's probably ok to default to the buffer logic for Python 2.4.

...

That still leaves the issue of the last read operation, which I'm tempted to leave unresolved for Python 2.4. No matter what the solution is, it would likely require changes to all codecs, which is not good.

We could have a method on the codec which checks whether the codec buffer or the stream still has pending data left. Using this method is an application scope consideration, not a codec issue. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 24 2004)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

8:15 p.m.

M.-A. Lemburg wrote:

...

Another option might be that the decode function changes the state object in place.

...

If a tuple is passed and returned this makes it possible from Python code to mangle the state. IMHO this should be avoided if possible.

...

We already have slightly different decoding functions: utf_16_ex_decode() takes additional arguments.

...

OK, I've updated the patch.

...

On the other hand this requires special stream. Data already read is part of the codec state, so why not put it into the codec?

...

OK.

...

But this mean that the normal error handling can't be used for those trailing bytes. Bye, Walter Dörwald

"Martin v. Löwis"

5:54 a.m.

Walter Dörwald wrote:

...

Not necessarily. We are all consenting adults. So if the code checks whether the state is sane, from a typing point of view, i.e. if you can't crash the interpreter, then there is no need to hide the state from the user. Regards, Martin

M.-A. Lemburg

8:32 a.m.

Martin v. Löwis wrote:

...

Martin, there are two reasons for hiding away these details: 1. we need to be able to change the codec state without breaking the APIs 2. we don't want the state to be altered by the user A single object serves this best and does not create a whole plethora of new APIs in the _codecs module. This is not over-design, but serves a reason. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 25 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

"Martin v. Löwis"

7:56 p.m.

M.-A. Lemburg wrote:

...

Martin, there are two reasons for hiding away these details:

1. we need to be able to change the codec state without breaking the APIs

That will be possible with the currently-proposed patch. The _codecs methods are not public API, so changing them would not be an API change.

...

2. we don't want the state to be altered by the user

We are all consenting adults, and we can't *really* prevent it, anyway. For example, the user may pass an old state, or a state originating from a different codec (instance). We need to support this gracefully (i.e. with a proper Python exception).

...

A single object serves this best and does not create a whole plethora of new APIs in the _codecs module. This is not over-design, but serves a reason.

It does put a burden on codec developers, which need to match the "official" state representation policy. Of course, if they are allowed to return a tuple representing their state, that would be fine with me. Regards, Martin

M.-A. Lemburg

3:28 p.m.

Martin v. Löwis wrote:

...

Uhm, I wasn't talking about the builtin codecs only (of course, we can change those to our liking). I'm after a generic interface for stateful codecs.

...

True, but the codec writer should be in control of the state object, its format and what the user can or cannot change.

...

They can use any object they like to keep the state in whatever format they choose. I think this makes it easier on the codec writer, rather than harder. Furthermore, they can change the way they store state e.g. to accomodate for new features they may want to add to the codec, without breaking the interface. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 26 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

8:19 p.m.

M.-A. Lemburg wrote:

...

But that interface is only between the StreamReader and any helper function that the codec implementer might want to use. If there ise no helper function there is no interface.

...

Yes, we should not dictate, how, why or if the codec implementer has to pass around any state. The only thing we have to dictate is that StreamReaders have to keep their state between calls to read().

...

That's basically the current state of the codec machinery, so we don't have to change anything in the specification. BTW, I hope that I can add updated documentation to the patch tomorrow (for PyUnicode_DecodeUTF8Stateful() and friends and for the additional arguments to read()), because I'll be on vacation the next week. Bye, Walter Dörwald

"Martin v. Löwis"

9 p.m.

M.-A. Lemburg wrote:

...

But we already have that! The StreamReader/StreamWriter interface is perfectly suited for stateful codecs, and we have been supporting it for several years now. The only problem is that a few builtin codecs failed to implement that correctly, and Walter's patch fixes that bug. There is no API change necessary whatsoever.

...

And indeed, with the current API, he is. Regards, Martin

Walter Dörwald

8:13 p.m.

Martin v. Löwis wrote:

...

Exactly.

...

The state communicated in the UTF-7 decoder is just a bunch of values. Checking the type is done via PyArg_ParseTuple().

...

Looking at the UTF-7 decoder this seems to be the simplest option. Bye, Walter Dörwald

Walter Dörwald

8:06 p.m.

Martin v. Löwis wrote:

...

OK, that's true. Unfortunately I'm not really that familiar with the bit twiddling in the UTF7 decoder to know it this could break anything or not. Bye, Walter Dörwald

M.-A. Lemburg

8:41 a.m.

Walter Dörwald wrote:

...

Good idea.

...

Right - it was a step in the wrong direction. Let's not use a different path for the future.

...

Ideally, the codec should not store data, but only reference it. It's better to keep things well separated which is why I think we need the .unread() interface and eventually a queue interface to support the feeding operation.

...

Right, but then: missing data (which usually causes the trailing bytes) is really something for the application to deal with, e.g. by requesting more data from the user, another application or trying to work around the problem in some way. I don't think that the codec error handler can practically cover these possibilities. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 25 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

8:10 p.m.

M.-A. Lemburg wrote:

...

But that's totally up to the implementor.

...

utf_16_ex_decode() serves a purpose: it help implement the UTF16 decoder, which has to switch to UTF-16-BE or UTF-16-LE according to the BOM, so utf_16_ex_decode() needs a way to comunicate that back to the caller.

...

I consider the remaining undecoded bytes to be part of the codec state once the have been read from the stream.

...

But in many cases the user might want to use "ignore" or "replace" error handling. Bye, Walter Dörwald

"Martin v. Löwis"

8:38 p.m.

M.-A. Lemburg wrote:

...

Why is that better? Practicality beats purity. This is useless over-generalization.

...

What is "a codec" here? A class implementing the StreamReader and/or Codec interface? Walter's patch does not change the API of any of these. It just adds a few functions to some module, which are not meant to be called directly.

...

Where precisely is the number of decoded bytes in the API? Regards, Martin

Walter Dörwald

8:04 p.m.

Martin v. Löwis wrote:

...

Walter Dörwald wrote:

...
OK, let's come up with a patch that fixes the incomplete byte sequences problem and then discuss non-stream APIs.

So, what should the next step be?

I think your first patch should be taken as a basis for that. Add the state-supporting decoding C functions, and change the stream readers to use them.

OK, I've updated the patch.

...

That still leaves the issue of the last read operation, which I'm tempted to leave unresolved for Python 2.4.

Agreed! This shouldn't be done for Python 2.4.

...

No matter what the solution is, it would likely require changes to all codecs, which is not good.

True, but the changes should be rather trivial for most. Bye, Walter Dörwald

M.-A. Lemburg

10:29 a.m.

Walter Dörwald wrote:

...

Right. It also needs a method giving the number of pending bytes in the queue or just an API .has_pending_data() that returns True/False. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 19 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Hye-Shik Chang

12:21 p.m.

On Thu, 19 Aug 2004 12:29:12 +0200, M.-A. Lemburg <mal@egenix.com> wrote:

...

+1 for adding .has_pending_data() stuff. But it'll need a way to flush pending data out for encodings where incomplete sequence not always invalid. <wink> This is true for JIS X 0213 encodings.

...

Hye-Shik

M.-A. Lemburg

12:25 p.m.

Hye-Shik Chang wrote:

...

I'm not sure I understand. The queue will also have an .unread() method (or similiar) to write data back into the queue at the reading head position. Are you suggesting that we add a .truncate() method to truncate the read buffer at the current position ? Since the queue will be in memory, we can also add .writeseek() and .readseek() if that helps. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 19 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

6:12 p.m.

M.-A. Lemburg wrote:

...

As already stated in a previous post, I don't think StreamQueue needs a pushback() method. trailing bytes should all be stored in the StreamReader.

...

What would this method be used for? Bye, Walter Dörwald

M.-A. Lemburg

8:03 p.m.

Walter Dörwald wrote:

...

I'd leave that to the StreamReader implementor. I would always push the data back onto the queue, simply because it's unprocessed data.

...

If you push the data back onto the queue, you will probably want to check whether there's pending data left. That's what this method is intended to tell you. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 19 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg

9:41 a.m.

Martin v. Löwis wrote:

...

We'll have to see... all this is still very much brainstorming.

...

Let's flesh this out some more and get a better understanding of what is needed and how the separation between the stream queue, the stream codec and the underlying codec implementation can be put to good use. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 19 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

"Martin v. Löwis"

6:54 p.m.

M.-A. Lemburg wrote:

...

That really didn't answer the question: What would be technically wrong with accepting Walter's patches? I smell over-designing: there is a specific problem at hand (incomplete reads resulting in partial decoding); any solution should attempt to *only* solve this problem, not any other problem. Regards, Martin

M.-A. Lemburg

7:59 p.m.

Martin v. Löwis wrote:

...

The specific problem is that of providing a codec that can run in feeding mode where you can feed in data and read it in a way that allows incomplete input data to fail over nicely. Since this requires two head positions (one for the writer, one for the reader), a queue implementation is the right thing to use. We are having this discussion to find a suitable design to provide this functionality in a nice and clean way. I don't see anything wrong with this. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 19 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

"Martin v. Löwis"

8:42 p.m.

M.-A. Lemburg wrote:

...

The problem is that this has nothing to do with decoding incomplete unicode. Regards, Martin

Walter Dörwald

8:11 p.m.

Martin v. Löwis wrote:

...

We already have an efficient way to communicate incompleteness: the decode method returns the number of decoded bytes. The questions remaining are 1) communicate to whom? IMHO the info should only be used internally by the StreamReader. 2) When is incompleteness OK? Incompleteness is of course not OK in the stateless API. For the stateful API, incompleteness has to be OK even when the input stream is (temporarily) exhausted, because otherwise a feed mode wouldn't work anyway. But then incompleteness is always OK, because the StreamReader can't distinguish a temporarily exhausted input stream from a permanently exhausted one. The only fix for this I can think of is the final argument. Bye, Walter Dörwald

"Martin v. Löwis"

8:39 p.m.

Walter Dörwald wrote:

...

I don't think the final argument is needed. Methinks that the .encode/.decode should not be used by an application. Instead, applications should only use the file API on a reader/writer. If so, stateful readers/writers can safely implement encode/decode to take whatever state they have into account, creating new state as they see fit. Of course, stateful writers need to implement their own .close function, which flushes the remaining bytes, and need to make sure that .close is automatically invoked if the object goes away. Regards, Martin

M.-A. Lemburg

9:34 a.m.

Walter Dörwald wrote:

...

Handling incompleteness should be something for the codec to deal with. The queue doesn't have to know about it in an way. However, the queue should have interfaces allowing the codec to tell whether there are more bytes waiting to be processed.

...

A final argument may be the way to go. But it should be an argument for the .read() method (not only the .decode() method) since that's the method reading the data from the queue. I'd suggest that we extend the existing encode and decode codec APIs to take an extra state argument that holds the codec state in whatever format the codec needs (e.g. this could be a tuple or a special object): encode(data, errors='strict', state=None) decode(data, errors='strict', state=None) In the case of the .read() method, decode() would be called. If the returned length_consumed does not match the length of the data input, the remaining items would have to be placed back onto the queue in non-final mode. In final mode an exception would be raised to signal the problem. I think it's PEP time for this new extension. If time permits I'll craft an initial version over the weekend. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 19 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

5:34 p.m.

M.-A. Lemburg wrote:

...

Walter Dörwald wrote:

...
Martin v. Löwis wrote:

[...] We already have an efficient way to communicate incompleteness: the decode method returns the number of decoded bytes.

The questions remaining are

1) communicate to whom? IMHO the info should only be used internally by the StreamReader.

Handling incompleteness should be something for the codec to deal with.

Absolutely. This means that decode() should not be called by the user. (But the implementation of read() (and feed(), if we have it) calls it.)

...

The queue doesn't have to know about it in an way. However, the queue should have interfaces allowing the codec to tell whether there are more bytes waiting to be processed.

This won't work when the byte stream wrapped by the StreamReader is not a queue. (Or do you want the wrap the byte stream in a queue? This would be three wrapping layers.) And the information is not really useful, because it might change (e.g. when the user puts additional data into the queue/stream.)

...

...
2) When is incompleteness OK? Incompleteness is of course not OK in the stateless API. For the stateful API, incompleteness has to be OK even when the input stream is (temporarily) exhausted, because otherwise a feed mode wouldn't work anyway. But then incompleteness is always OK, because the StreamReader can't distinguish a temporarily exhausted input stream from a permanently exhausted one. The only fix for this I can think of is the final argument.

A final argument may be the way to go. But it should be an argument for the .read() method (not only the .decode() method) since that's the method reading the data from the queue.

Yes. E.g. the low level charmap decode function doesn't need the final argument, because there is zero state to be kept between calls.

...

I'd suggest that we extend the existing encode and decode codec APIs to take an extra state argument that holds the codec state in whatever format the codec needs (e.g. this could be a tuple or a special object):

encode(data, errors='strict', state=None) decode(data, errors='strict', state=None)

We don't need a specification for that. The stateless API doesn't need an explicit state (the state is just a bunch of variables at the C level) and for the stateful API the state can be put into StreamReader attributes. How this state looks is totally up to the StreamReader itself (see the UTF-7 reader in the patch for an example). If the stream reader passes on this state to a low level decoding function implemented in C, how this state info looks is again totally up to the codec. So I think we don't have to specify anything in this area.

...

In the case of the .read() method, decode() would be called. If the returned length_consumed does not match the length of the data input, the remaining items would have to be placed back onto the queue in non-final mode. In final mode an exception would be raised to signal the problem.

Yes, in non-final mode the bytes would have to be retained and in final mode an exception is raised (except when the error handling callback does something else). But I don't think we should put a queue between the byte stream and the StreamReader (at least not in the sense of a queue as another file like object). The remaining items can be kept in an attribute of the StreamReader instance, that's what --- data = self.bytebuffer + newdata object, decodedbytes = self.decode(data, self.errors) self.bytebuffer = data[decodedbytes:] --- does in the patch. The first line combines the items retained from the last call with those read from the stream (or passed to the feed method). The second line does semi-stateful decoding of those bytes. The third line puts the new remaining items back into the buffer. The decoding is "semi-stateful", because the info about the remaining bytes is not stored by decode itself, but by the caller of decode. feed() is the method that does fully stateful decoding of byte chunks.

...

I think it's PEP time for this new extension. If time permits I'll craft an initial version over the weekend.

I'm looking forward to the results. Bye, Walter Dörwald

Walter Dörwald

7:46 p.m.

M.-A. Lemburg wrote:

...

No, because when the decode method encounters an incomplete chunk (and so return a size that is smaller then size of the input) read() would have to push the remaining bytes back into the queue. This would be code similar in functionality to the feed() method from the patch, with the difference that the buffer lives in the queue not the StreamReader. So we won't gain any code simplification by going this route.

...

Yes, so we could put this Queue class into a module with string utilities. Maybe string.py?

...

But why should we, when decode() does most of what we need, and the rest has to be implemented in both versions? Bye, Walter Dörwald

M.-A. Lemburg

8:07 p.m.

Walter Dörwald wrote:

...

Maybe not code simplification, but the APIs will be well- separated. If we require the queue type for feeding mode operation we are free to define whatever APIs are needed to communicate between the codec and the queue type, e.g. we could define a method that pushes a few bytes back onto the queue end (much like ungetc() in C).

...

Hmm, I think a separate module would be better since we could then recode the implementation in C at some point (and after the API has settled). We'd only need a new name for it, e.g. StreamQueue or something.

...

To hide the details from the user. It should be possible to instantiate one of these StreamQueueReaders (named after the queue type) and simply use it in feeding mode without having to bother about the details behind the implementation. StreamReaderWriter and StreamRecoder exist for the same reason. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 18 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

8:35 p.m.

M.-A. Lemburg wrote:

...

They will not, because StreamReader.decode() already is a feed style API (but with state amnesia). Any stream decoder that I can think of can be (and most are) implemented by overwriting decode().

...

That would of course be a possibility.

...

Sounds reasonable.

...

Let's compare example uses: 1) Having feed() as part of the StreamReader API: --- s = u"???".encode("utf-8") r = codecs.getreader("utf-8")() for c in s: print r.feed(c) --- 2) Explicitely using a queue object: --- from whatever import StreamQueue s = u"???".encode("utf-8") q = StreamQueue() r = codecs.getreader("utf-8")(q) for c in s: q.write(c) print r.read() --- 3) Using a special wrapper that implicitely creates a queue: ---- from whatever import StreamQueueWrapper s = u"???".encode("utf-8") r = StreamQueueWrapper(codecs.getreader("utf-8")) for c in s: print r.feed(c) ---- I very much prefer option 1). "If the implementation is hard to explain, it's a bad idea." Bye, Walter Dörwald

"Martin v. Löwis"

8:51 p.m.

Walter Dörwald wrote:

...

I consider that an unfortunate implementation artefact. You either use the stateless encode/decode that you get from codecs.get(encoder/decoder) or you use the file API on the streams. You never ever use encode/decode on streams. I would have preferred if the default .write implementation would have called self._internal_encode, and the Writer would *contain* a Codec, rather than inheriting from Codec. Alas, for (I guess) simplicity, a more direct (and more confusing) approach was taken.

...

Isn't that a totally unrelated issue? Aren't we talking about short reads on sockets etc? I would very much prefer to solve one problem at a time. Regards, Martin

Walter Dörwald

3:45 p.m.

Martin v. Löwis wrote:

...

Walter Dörwald wrote:

...
They will not, because StreamReader.decode() already is a feed style API (but with state amnesia).

Any stream decoder that I can think of can be (and most are) implemented by overwriting decode().

I consider that an unfortunate implementation artefact. You either use the stateless encode/decode that you get from codecs.get(encoder/decoder) or you use the file API on the streams. You never ever use encode/decode on streams.

That is exactly the problem with the current API. StreamReader mixes two concepts: 1) The stateful API, which allows decoding a byte input in chunk and the state of the decoder is kept between calls. 2) A file API where the chunks to be decoded are read from a byte stream.

...

I would have preferred if the default .write implementation would have called self._internal_encode, and the Writer would *contain* a Codec, rather than inheriting from Codec.

This would separate the two concepts from above.

...

Alas, for (I guess) simplicity, a more direct (and more confusing) approach was taken.

...
1) Having feed() as part of the StreamReader API: --- s = u"???".encode("utf-8") r = codecs.getreader("utf-8")() for c in s: print r.feed(c)

Isn't that a totally unrelated issue? Aren't we talking about short reads on sockets etc?

We're talking about two problems: 1) The current implementation does not really support the stateful API, because trailing incomplete byte sequences lead to errors. 2) The current file API is not really convenient for decoding when the input is not read for a stream.

...

I would very much prefer to solve one problem at a time.

Bye, Walter Dörwald

M.-A. Lemburg

4:06 p.m.

Walter Dörwald wrote:

...

Note that StreamCodec only inherits from Codec for convenience reasons (you can define a StreamCodec using the stateless .encode() and .decode() methods you get from Codec) and for logical reasons: a StreamCodec happens to be a Codec as well, so isinstance(obj, Codec) should be true for a StreamCodec as well. There's nothing preventing you from overriding .encode() and .decode() in a StreamReader or adding new methods that implement a different approach to encode and decode. Users should always use the file API of StreamReader et al., not the .encode() and .decode() methods. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 19 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

"Martin v. Löwis"

6:58 p.m.

Walter Dörwald wrote:

...

Correct.

...

2) The current file API is not really convenient for decoding when the input is not read for a stream.

I don't see this problem. It is straight-forward to come up with a file-like object that converts the data from whatever source into a stream. If it wasn't a byte stream/string eventually, there would be no way to meaningfully decode it. Regards, Martin

M.-A. Lemburg

10:15 a.m.

Walter Dörwald wrote:

...

Let's compare example uses:

1) Having feed() as part of the StreamReader API: --- s = u"???".encode("utf-8") r = codecs.getreader("utf-8")() for c in s: print r.feed(c) ---

I consider adding a .feed() method to the stream codec bad design. .feed() is something you do on a stream, not a codec.

...

2) Explicitely using a queue object: --- from whatever import StreamQueue

s = u"???".encode("utf-8") q = StreamQueue() r = codecs.getreader("utf-8")(q) for c in s: q.write(c) print r.read() ---

This is probably how an advanced codec writer would use the APIs to build new stream interfaces.

...

3) Using a special wrapper that implicitely creates a queue: ---- from whatever import StreamQueueWrapper s = u"???".encode("utf-8") r = StreamQueueWrapper(codecs.getreader("utf-8")) for c in s: print r.feed(c) ----

This could be turned into something more straight forward, e.g. from codecs import EncodedStream # Load data s = u"???".encode("utf-8") # Write to encoded stream (one byte at a time) and print # the read output q = EncodedStream(input_encoding="utf-8", output_encoding="unicode") for c in s: q.write(c) print q.read() # Make sure we have processed all data: if q.has_pending_data(): raise ValueError, 'data truncated'

...

I very much prefer option 1).

I prefer the above example because it's easy to read and makes things explicit.

...

"If the implementation is hard to explain, it's a bad idea."

The user usually doesn't care about the implementation, only it's interfaces. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 19 2004)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

6:09 p.m.

M.-A. Lemburg wrote:

...

I don't care about the name, we can call it stateful_decode_byte_chunk() or whatever. (In fact I'd prefer to call it decode(), but that name is already taken by another method. Of course we could always rename decode() to _internal_decode() like Martin suggested.)

...

This is confusing, because there is no encoding named "unicode". This should probably read: q = EncodedQueue(encoding="utf-8", errors="strict")

...

This should be the job of the error callback, the last part should probably be: for c in s: q.write(c) print q.read() print q.read(final=True)

...

Bye, Walter Dörwald

M.-A. Lemburg

8:09 p.m.

Walter Dörwald wrote:

...

It's not that name that doesn't fit, it's the fact that you are mixing a stream action into a codec which I'd rather see well separated.

...

Fine. I was thinking of something similar to EncodedFile() which also has two separate encodings, one for the file side of things and one for the Python side.

...

Ok; both methods have their use cases. (You seem to be obsessed with this final argument ;-)

...

-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 19 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

July 2004

8:36 a.m.

Martin v. Löwis wrote:

...

We would need a special attribute in the exception for that, but the problem IMHO is a different one. This makes it impossible to use other error handling schemes than "break" in stateful decoders. Bye, Walter Dörwald

M.-A. Lemburg

9:16 a.m.

Walter Dörwald wrote:

...

I don't understand... are you referring to some extra attribute for storing arbitrary state ? If so, why would adding such an attribute make it impossible to use other error handling schemes ? The problem with your patch is that you are adding a whole new set of decoders to the core which duplicate much of what the already existing decoders implement. I don't like that duplication and would like to find a way to only have *one* implementation per decode operation. Of course, encoders would have to provide the same interfaces for symmetry reasons. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 28 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

10:07 a.m.

M.-A. Lemburg wrote:

...

The position of the error is not sufficient to determine whether it is a truncated data error or a real one: both r"a\xf".decode("unicode-escape") and r"a\xfx".decode("unicode-escape") raise a UnicodeDecodeException with exc.end == len(exc.object), i.e. the error is at the end of the input. But in the first case the error will go away once more data is available, but in the second case it won't.

...

It doesn't, but it would make it possible for the callback to distinguish transient errors from real ones.

...

I don't like the duplication either. In fact we might need decoders that pass state, but do complain about truncated data at the end of the stream. I think it's possible to find other solutions. I would prefer stateful decoders implemented in C. But I hope you agree that this is a problem that should be fixed.

...

There are no encoders that have to keep state, except for UTF-16. Bye, Walter Dörwald

Josiah Carlson

1:26 a.m.

...

- readline() and readlines() have an additional option for dropping the u"\n".

I'll point out (since no one else has so far) that the reasons for keeping the linefeed at the end of lines returned by readline() and readlines() are documented here: http://docs.python.org/lib/bltin-file-objects.html#foot3271 Specifically it allows one to use the following and have it "do the right thing". while 1: line = obj.readline() if not line: break process(line) Having the option of readline() and readlines() being ambiguous anywhere sounds like a misfeature. Furthermore, since all other readline and readlines calls that do not inherit from StreamReader use the unambiguous "always include the line ending", changing StreamReader to be different is obviously the wrong thing to do. - Josiah

Bob Ippolito

2:20 a.m.

On Jul 27, 2004, at 9:26 PM, Josiah Carlson wrote:

...

While this reasoning makes sense for readline(), it most definitely does not for readlines() or __iter__(). unicode objects have a splitlines() function with this feature, which is probably what he's using in his implementation (I used it in mine), so it's pretty natural to expose that option to the external interface. -bob

Walter Dörwald

9:24 a.m.

Bob Ippolito wrote:

...

...
[...] Having the option of readline() and readlines() being ambiguous anywhere sounds like a misfeature. Furthermore, since all other readline and readlines calls that do not inherit from StreamReader use the unambiguous "always include the line ending", changing StreamReader to be different is obviously the wrong thing to do.

It isn't: the default is still keepends=True. (In fact changing it breaks the PEP 263 functionality).

...

While this reasoning makes sense for readline(), it most definitely does not for readlines() or __iter__(). unicode objects have a splitlines()

That's exactly where I got the idea from. And it frees the user from having to deal with the various possible line endings (\r, \n, \r\n, \u2028). But note that the patch doesn't implement this yet, it only breaks lines at \n.

...

function with this feature, which is probably what he's using in his implementation (I used it in mine), so it's pretty natural to expose that option to the external interface.

Bye, Walter Dörwald

Hye-Shik Chang

3:51 a.m.

On Tue, 27 Jul 2004 22:39:45 +0200, Walter Dörwald <walter@livinglogic.de> wrote:

...

Pythons unicode machinery currently has problems when decoding incomplete input.

When codecs.StreamReader.read() encounters a decoding error it reads more bytes from the input stream and retries decoding. This is broken for two reasons: 1) The error might be due to a malformed byte sequence in the input, a problem that can't be fixed by reading more bytes. 2) There may be no more bytes available at this time. Once more data is available decoding can't continue because bytes from the input stream have already been read and thrown away. (sio.DecodingInputFilter has the same problems)

StreamReaders and -Writers from CJK codecs are not suffering from this problems because they have internal buffer for keeping states and incomplete bytes of a sequence. In fact, CJK codecs has its own implementation for UTF-8 and UTF-16 on base of its multibytecodec system. It provides a "working" StreamReader/Writer already. :)

...

I've uploaded a patch that fixes these problems to SF: http://www.python.org/sf/998993

The patch implements a few additional features: - read() has an additional argument chars that can be used to specify the number of characters that should be returned. - readline() is supported on all readers derived from codecs.StreamReader().

I have no comment for these, yet.

...

- readline() and readlines() have an additional option for dropping the u"\n".

+1 I wonder whether we need to add optional argument for writelines() to add newline characters for each lines, then. Hye-Shik

Walter Dörwald

9:38 a.m.

Hye-Shik Chang wrote:

...

Seems you had the same problems with the builtin stream readers! ;) BTW, how do you solve the problem that incomplete byte sequences are retained in the middle of a stream, but should generate errors at the end?

...

This would probably be a nice convenient additional feature, but of course you could always pass a GE to writelines(): stream.writelines(line+u"\n" for line in lines) Bye, Walter Dörwald

Hye-Shik Chang

12:46 p.m.

On Wed, 28 Jul 2004 11:38:16 +0200, Walter Dörwald <walter@livinglogic.de> wrote:

...

Rough pseudo code here: (it's written in C in CJKCodecs) class StreamReader: pending = '' # incomplete def read(self, size=-1): while True: r = fp.read(size) if self.pending: r = self.pending + r self.pending = '' if r: try: outputbuffer = r.decode('utf-8') except MBERR_TOOFEW: # incomplete multibyte sequence pass except MBERR_ILLSEQ: # illegal sequence raise UnicodeDecodeError, "illseq" if not r or size == -1: # end of the stream if r have not consumed up for the output: raise UnicodeDecodeError, "toofew" if r have not consumed up for the output: self.pending = remainders of r if (size == -1 or # one time read up len(outputbuffer) > 0 or # output buffer isn't empty original length of r == 0): # the end of the stream break size = 1 # read 1 byte in next try return outputbuffer CJKcodecs' multibytecodec structure has distinguished internal error codes for "illegal sequence" and "incomplete sequence". And each internal codecs receive a flag that indicates if immediate flush is needed at time. (for the end of streams and simple decode functions) Regards, Hye-Shik

Walter Dörwald

3:13 p.m.

Hye-Shik Chang wrote:

...

Here's the problem: I'd like the streamreader to be able to continue even when there is no input available *now*. Perhaps there should be an additional argument to read() named final? If final is true, the stream reader makes sure that all pending bytes have been used up.

...

Bye, Walter Dörwald

M.-A. Lemburg

July 2004

8:59 p.m.

Walter Dörwald wrote:

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

"Martin v. Löwis"

9:15 p.m.

M.-A. Lemburg wrote:

...

Can you demonstrate this approach in a patch? I think it is unimplementable: the codec cannot communicate to the error callback that it ran out of data. Regards, Martin

M.-A. Lemburg

9:43 p.m.

Martin v. Löwis wrote:

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

9:19 a.m.

M.-A. Lemburg wrote:

...

Martin v. Löwis wrote:

...
M.-A. Lemburg wrote:

...
I like the idea, but don't think the implementation is the right way to do it. Instead, I'd suggest using a new error handling strategy "break" ( = break processing as soon as errors are found).

Can you demonstrate this approach in a patch? I think it is unimplementable: the codec cannot communicate to the error callback that it ran out of data.

The codec can: the callback gets all the necessary information and can even manipulate the objects being worked on.

But you have a point: the current implementations of the various encode/decode functions don't provide interfaces to report back the number of bytes read at C level - the codecs module wrappers add these numbers assuming that all bytes were read.

...

The error callbacks could, however, raise an exception which includes all the needed information, including any state that may be needed in order to continue with coding operation.

This makes error callbacks effectively unusable with stateful decoders.

...

We may then need to allow additional keyword arguments on the encode/decode functions in order to preset a start state.

M.-A. Lemburg

9:42 a.m.

Walter Dörwald wrote:

...

Why make things more complicated ? If you absolutely need a feed interface, you can feed your data to a StringIO instance which is then read from by StreamReader.

...

Could you explain ?

...

See above: possibility 1 would be the way to go then. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 28 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

4:55 p.m.

M.-A. Lemburg wrote:

...

Which in most cases is the read method.

...

But I'd like to reuse at least some of the functionality from PyUnicode_DecodeUTF8() etc. Would a version of PyUnicode_DecodeUTF8() with an additional PyUTF_DecoderState * be OK?

...

This doesn't work, because a StringIO has only one file position:

...

But something like the Queue class from the tests in the patch might work.

...

If you have to call the decode function with errors='break', you will only get the break error handling and nothing else.

...

I might give this a try. Bye, Walter Dörwald

M.-A. Lemburg

July 2004

6:31 p.m.

Walter Dörwald wrote:

...

M.-A. Lemburg wrote:

...
Walter Dörwald wrote:

...
This is the correct thing to do for the stateless decoders: any incomplete byte sequence at the end of the input is an error. But then it doesn't make sense to return the number of bytes decoded for the stateless decoder, because this is always the size of the input.

The reason why stateless encode and decode APIs return the number of input items consumed is to accomodate for error handling situations like these where you want to stop coding and leave the remaining work to another step.

Which in most cases is the read method.

The read method only happens to use the stateless encode and decode methods. There nothing in the design spec that mandates this, though.

...

...
The C implementation currently doesn't make use of this feature.

...
For the stateful decoder this is just some sort of state common to all decoders: the decoder keeps the incomplete byte sequence to be used in the next call. But then this state should be internal to the decoder and not part of the public API. This would make the decode() method more usable: When you want to implement an XML parser that supports the xml.sax.xmlreader.IncrementalParser interface, you have an API mismatch. The parser has to use the stateful decoding API (i.e. read()), which means the input is in the form of a byte stream, but this interface expects it's input as byte chunks passed to multiple calls to the feed() method. If StreamReader.decode() simply returned the decoded unicode object and keep the remaining undecoded bytes as an internal state then StreamReader.decode() would be directly usable.

Please don't mix "StreamReader" with "decoder". The codecs module returns 4 different objects if you ask it for a codec set: two stateless APIs for encoding and decoding and two factory functions for creating possibly stateful objects which expose a stream interface.

Your "stateful decoder" is actually part of a StreamReader implementation and doesn't have anything to do with the stateless decoder.

I know. I'd just like to have a stateful decoder that doesn't use a stream interface. The stream interface could be built on top of that without any knowlegde of the encoding.

I wonder whether the decode method is part of the public API for StreamReader.

It is: StreamReader/Writer are "sub-classes" of the Codec class. However, there's nothing stating that .read() or .write() *must* use these methods to do their work and that's intentional.

...

...
I see two possibilities here:

1. you write a custom StreamReader/Writer implementation for each of the codecs which takes care of keeping state and encoding/decoding as much as possible

But I'd like to reuse at least some of the functionality from PyUnicode_DecodeUTF8() etc.

Would a version of PyUnicode_DecodeUTF8() with an additional PyUTF_DecoderState * be OK?

Before you start putting more work into this, let's first find a good workable approach.

...

...
2. you extend the existing stateless codec implementations to allow communicating state on input and output; the stateless operation would then be a special case

...
But this isn't really a "StreamReader" any more, so the best solution would probably be to have a three level API: * A stateless decoding function (what codecs.getdecoder returns now); * A stateful "feed reader", which keeps internal state (including undecoded byte sequences) and gets passed byte chunks (should this feed reader have a error attribute or should this be an argument to the feed method?); * A stateful stream reader that reads its input from a byte stream. The functionality for the stream reader could be implemented once using the underlying functionality of the feed reader (in fact we could implement something similar to sio's stacking streams: the stream reader would use the feed reader to wrap the byte input stream and add only a read() method. The line reading methods (readline(), readlines() and __iter__() could be added by another stream filter)

Why make things more complicated ?

If you absolutely need a feed interface, you can feed your data to a StringIO instance which is then read from by StreamReader.

This doesn't work, because a StringIO has only one file position:

...
...
...
import cStringIO s = cStringIO.StringIO() s.write("x") s.read() ''

Ah, you wanted to do both feeding and reading at the same time ?!

...

But something like the Queue class from the tests in the patch might work.

Right... I don't think that we need a third approach to codecs just to implement feed based parsers.

...

...
...
...
The error callbacks could, however, raise an exception which includes all the needed information, including any state that may be needed in order to continue with coding operation.

This makes error callbacks effectively unusable with stateful decoders.

Could you explain ?

If you have to call the decode function with errors='break', you will only get the break error handling and nothing else.

Yes and ... ? What else do you want it to do ?

...

...
...
...
We may then need to allow additional keyword arguments on the encode/decode functions in order to preset a start state.

As those decoding functions are private to the decoder that's probably OK. I wouldn't want to see additional keyword arguments on str.decode (which uses the stateless API anyway). BTW, that's exactly what I did for codecs.utf_7_decode_stateful, but I'm not really comfortable with the internal state of the UTF-7 decoder being exposed on the Python level. It would be better to encapsulate the state in a feed reader implemented in C, so that the state is inaccessible from the Python level.

See above: possibility 1 would be the way to go then.

I might give this a try.

Again, please wait until we have found a good solution to this. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 28 2004)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

8:30 p.m.

M.-A. Lemburg wrote:

...

Any read() method can be implemented on top of a stateful decode() method.

...

We already have most of the functionality in the decode method.

...

OK. Bye, Walter Dörwald

Walter Dörwald

August 2004

7:24 p.m.

M.-A. Lemburg

1:15 p.m.

...

OK, here a my current thoughts on the codec problem:

The optimal solution (ignoring backwards compatibility) would look like this: codecs.lookup() would return the following stuff (this could be done by replacing the 4 entry tuple with a real object):

* decode: The stateless decoding function * encode: The stateless encocing function * chunkdecoder: The stateful chunk decoder * chunkencoder: The stateful chunk encoder * streamreader: The stateful stream decoder * streamwriter: The stateful stream encoder

The functions and classes look like this:

Stateless decoder: decode(input, errors='strict'): Function that decodes the (str) input object and returns a (unicode) output object. The decoder must decode the complete input without any remaining undecoded bytes.

Stateless encoder: encode(input, errors='strict'): Function that encodes the complete (unicode) input object and returns a (str) output object.

Stateful chunk decoder: chunkdecoder(errors='strict'): A factory function that returns a stateful decoder with the following method:

decode(input, final=False): Decodes a chunk of input and return the decoded unicode object. This method can be called multiple times and the state of the decoder will be kept between calls. This includes trailing incomplete byte sequences that will be retained until the next call to decode(). When the argument final is true, this is the last call to decode() and trailing incomplete byte sequences will not be retained, but a UnicodeError will be raised.

Stateful chunk encoder: chunkencoder(errors='strict'): A factory function that returns a stateful encoder with the following method: encode(input, final=False): Encodes a chunk of input and returns the encoded str object. When final is true this is the last call to encode().

Stateful stream decoder: streamreader(stream, errors='strict'): A factory function that returns a stateful decoder for reading from the byte stream stream, with the following methods:

read(size=-1, chars=-1, final=False): Read unicode characters from the stream. When data is read from the stream it should be done in chunks of size bytes. If size == -1 all the remaining data from the stream is read. chars specifies the number of characters to read from the stream. read() may return less then chars characters if there's not enough data available in the byte stream. If chars == -1 as much characters are read as are available in the stream. Transient errors are ignored and trailing incomplete byte sequence are retained when final is false. Otherwise a UnicodeError is raised in the case of incomplete byte sequences. readline(size=-1): ... next(): ... __iter__(): ...

Stateful stream encoder: streamwriter(stream, errors='strict'): A factory function that returns a stateful encoder for writing unicode data to the byte stream stream, with the following methods:

write(data, final=False): Encodes the unicode object data and writes it to the stream. If final is true this is the last call to write(). writelines(data): ...

I know that this is quite a departure from the current API, and I'm not sure if we can get all of the functionality without sacrificing backwards compatibility.

I don't particularly care about the "bytes consumed" return value from the stateless codec. The codec should always have returned only the encoded/decoded object, but I guess fixing this would break too much code. And users who are only interested in the stateless functionality will probably use unicode.encode/str.decode anyway.

For the stateful API it would be possible to combine the chunk and stream decoder/encode into one class with the following methods (for the decoder):

__init__(stream, errors='strict'): Like the current StreamReader constructor, but stream may be None, if only the chunk API is used. decode(input, final=False): Like the current StreamReader (i.e. it returns a (unicode, int) tuple.) This does not keep the remaining bytes in a buffer. This is the job of the caller. feed(input, final=False): Decodes input and returns a decoded unicode object. This method calls decode() internally and manages the byte buffer. read(size=-1, chars=-1, final=False): readline(size=-1): next(): __iter__(): See above.

As before implementers of decoders only need to implement decode().

To be able to support the final argument the decoding functions in _codecsmodule.c could get an additional argument. With this they could be used for the stateless codecs too and we can reduce the number of functions again.

Unfortunately adding the final argument breaks all of the current codecs, but dropping the final argument requires one of two changes: 1) When the input stream is exhausted, the bytes read are parsed as if final=True. That's the way the CJK codecs currently handle it, but unfortunately this doesn't work with the feed decoder. 2) Simply ignore any remaing undecoded bytes at the end of the stream.

If we really have to drop the final argument, I'd prefer 2).

I've uploaded a second version of the patch. It implements the final argument, adds the feed() method to StreamReader and again merges the duplicate decoding functions in the codecs module. Note that the patch isn't really finished (the final argument isn't completely supported in the encoders and the CJK and escape codecs are unchanged), but it should be sufficient as a base for discussion.

Bye, Walter Dörwald

_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/mal%40egenix.com

-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 12 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

8:04 p.m.

M.-A. Lemburg wrote:

...

Hi Walter,

I don't have time to comment on this this week, I'll respond next week.

OK.

...

Overall, I don't like the idea of adding extra APIs breaking the existing codec API.

...

I believe that we can extend stream codecs to also work in a feed mode without breaking the API.

M.-A. Lemburg

10:25 p.m.

Walter Dörwald wrote:

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

"Martin v. Löwis"

August 2004

4:57 a.m.

M.-A. Lemburg wrote:

...

M.-A. Lemburg

8:36 a.m.

Martin v. Löwis wrote:

...

Agreed. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 18 2004)

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

8:12 p.m.

M.-A. Lemburg wrote:

...

Exactly: this shouldn't be anything officially documented, because what kind of data is passed around depends on the codec. And when the stream reader is implemented in C there isn't any API anyway.

...

Bye, Walter Dörwald

"Martin v. Löwis"

8:30 p.m.

...

Walter Dörwald

9:17 p.m.

Martin v. Löwis wrote:

...

"Martin v. Löwis"

9:57 p.m.

Walter Dörwald wrote:

...

But then a file that contains the two bytes 0x61, 0xc3 will never generate an error when read via an UTF-8 reader. The trailing 0xc3 will just be ignored.

Another option we have would be to add a final() method to the StreamReader, that checks if all bytes have been consumed.

Alternatively, we could add a .buffer() method that returns any data that are still pending (either a Unicode string or a byte string).

...

Maybe this should be done by StreamReader.close()?

No. There is nothing wrong with only reading a part of a file.

...

Now inShift counts the number of characters (and the shortcut for a "+-" sequence appearing together has been removed.

Ok. I didn't actually check the correctness of the individual methods. OTOH, I think time spent on UTF-7 is wasted, anyway.

...

Would a version of the patch without a final argument but with a feed() method be accepted?

I don't see the need for a feed method. .read() should just block until data are available, and that's it.

...

I'm imagining implementing an XML parser that uses Python's unicode machinery and supports the xml.sax.xmlreader.IncrementalParser interface.

I think this is out of scope of this patch. The incremental parser could implement a regular .read on a StringIO file that also supports .feed.

...

Without the feed method(), we need the following:

1) A StreamQueue class that

Why is that? I thought we are talking about "Decoding incomplete unicode"? Regards, Martin

Walter Dörwald

August 2004

4:49 p.m.

Martin v. Löwis wrote:

...

Walter Dörwald wrote:

...
But then a file that contains the two bytes 0x61, 0xc3 will never generate an error when read via an UTF-8 reader. The trailing 0xc3 will just be ignored.

Another option we have would be to add a final() method to the StreamReader, that checks if all bytes have been consumed.

Alternatively, we could add a .buffer() method that returns any data that are still pending (either a Unicode string or a byte string).

...

...
Maybe this should be done by StreamReader.close()?

No. There is nothing wrong with only reading a part of a file.

Yes, but if read() is called without arguments then everything from the input stream should be read and used.

...

...
Now inShift counts the number of characters (and the shortcut for a "+-" sequence appearing together has been removed.

Ok. I didn't actually check the correctness of the individual methods.

OTOH, I think time spent on UTF-7 is wasted, anyway.

;) But it's a good example of how complicated state management can get.

...

...
Would a version of the patch without a final argument but with a feed() method be accepted?

I don't see the need for a feed method. .read() should just block until data are available, and that's it.

...

...
I'm imagining implementing an XML parser that uses Python's unicode machinery and supports the xml.sax.xmlreader.IncrementalParser interface.

I think this is out of scope of this patch. The incremental parser could implement a regular .read on a StringIO file that also supports .feed.

...

...
Without the feed method(), we need the following:

1) A StreamQueue class that

Why is that? I thought we are talking about "Decoding incomplete unicode"?

Well, I had to choose a subject. ;) Bye, Walter Dörwald

"Martin v. Löwis"

7:05 p.m.

Walter Dörwald wrote:

...

Ok. So it really looks like a final flag on read is necessary.

...

Well, I had to choose a subject. ;)

I still would prefer if the two issues were discussed separately. Regards, Martin

Walter Dörwald

8:41 p.m.

Martin v. Löwis wrote:

...

OK, let's come up with a patch that fixes the incomplete byte sequences problem and then discuss non-stream APIs. So, what should the next step be? Bye, Walter Dörwald

"Martin v. Löwis"

5:13 a.m.

Walter Dörwald wrote:

...

M.-A. Lemburg

9:36 a.m.

Martin v. Löwis wrote:

...

Walter Dörwald wrote:

...
OK, let's come up with a patch that fixes the incomplete byte sequences problem and then discuss non-stream APIs.

So, what should the next step be?

I think your first patch should be taken as a basis for that.

...

Add the state-supporting decoding C functions, and change the stream readers to use them.

...

That still leaves the issue of the last read operation, which I'm tempted to leave unresolved for Python 2.4. No matter what the solution is, it would likely require changes to all codecs, which is not good.

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

8:15 p.m.

M.-A. Lemburg wrote:

...

Another option might be that the decode function changes the state object in place.

...

If a tuple is passed and returned this makes it possible from Python code to mangle the state. IMHO this should be avoided if possible.

...

We already have slightly different decoding functions: utf_16_ex_decode() takes additional arguments.

...

OK, I've updated the patch.

...

On the other hand this requires special stream. Data already read is part of the codec state, so why not put it into the codec?

...

OK.

...

But this mean that the normal error handling can't be used for those trailing bytes. Bye, Walter Dörwald

"Martin v. Löwis"

August 2004

5:54 a.m.

Walter Dörwald wrote:

...

M.-A. Lemburg

8:32 a.m.

Martin v. Löwis wrote:

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

"Martin v. Löwis"

7:56 p.m.

M.-A. Lemburg wrote:

...

Martin, there are two reasons for hiding away these details:

1. we need to be able to change the codec state without breaking the APIs

That will be possible with the currently-proposed patch. The _codecs methods are not public API, so changing them would not be an API change.

...

2. we don't want the state to be altered by the user

...

A single object serves this best and does not create a whole plethora of new APIs in the _codecs module. This is not over-design, but serves a reason.

M.-A. Lemburg

3:28 p.m.

Martin v. Löwis wrote:

...

Uhm, I wasn't talking about the builtin codecs only (of course, we can change those to our liking). I'm after a generic interface for stateful codecs.

...

True, but the codec writer should be in control of the state object, its format and what the user can or cannot change.

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

8:19 p.m.

M.-A. Lemburg wrote:

...

But that interface is only between the StreamReader and any helper function that the codec implementer might want to use. If there ise no helper function there is no interface.

...

"Martin v. Löwis"

9 p.m.

M.-A. Lemburg wrote:

...

And indeed, with the current API, he is. Regards, Martin

Walter Dörwald

August 2004

8:13 p.m.

Martin v. Löwis wrote:

...

Exactly.

...

The state communicated in the UTF-7 decoder is just a bunch of values. Checking the type is done via PyArg_ParseTuple().

...

Looking at the UTF-7 decoder this seems to be the simplest option. Bye, Walter Dörwald

Walter Dörwald

8:06 p.m.

Martin v. Löwis wrote:

...

OK, that's true. Unfortunately I'm not really that familiar with the bit twiddling in the UTF7 decoder to know it this could break anything or not. Bye, Walter Dörwald

M.-A. Lemburg

8:41 a.m.

Walter Dörwald wrote:

...

Good idea.

...

Right - it was a step in the wrong direction. Let's not use a different path for the future.

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Walter Dörwald

8:10 p.m.

M.-A. Lemburg wrote:

...

But that's totally up to the implementor.

...

I consider the remaining undecoded bytes to be part of the codec state once the have been read from the stream.

...

But in many cases the user might want to use "ignore" or "replace" error handling. Bye, Walter Dörwald

"Martin v. Löwis"

8:38 p.m.

M.-A. Lemburg wrote:

...

Why is that better? Practicality beats purity. This is useless over-generalization.

...

Where precisely is the number of decoded bytes in the API? Regards, Martin

Walter Dörwald

8:04 p.m.

Martin v. Löwis wrote:

...

Walter Dörwald wrote:

...
OK, let's come up with a patch that fixes the incomplete byte sequences problem and then discuss non-stream APIs.

So, what should the next step be?

I think your first patch should be taken as a basis for that. Add the state-supporting decoding C functions, and change the stream readers to use them.

OK, I've updated the patch.

...

That still leaves the issue of the last read operation, which I'm tempted to leave unresolved for Python 2.4.

Agreed! This shouldn't be done for Python 2.4.

...

No matter what the solution is, it would likely require changes to all codecs, which is not good.

True, but the changes should be rather trivial for most. Bye, Walter Dörwald

M.-A. Lemburg

August 2004

10:29 a.m.

Walter Dörwald wrote:

...

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Hye-Shik Chang

12:21 p.m.

On Thu, 19 Aug 2004 12:29:12 +0200, M.-A. Lemburg <mal@egenix.com> wrote:

...

+1 for adding .has_pending_data() stuff. But it'll need a way to flush pending data out for encodings where incomplete sequence not always invalid. <wink> This is true for JIS X 0213 encodings.

...

Hye-Shik