[Python-3000] [Python-3000-checkins] r54742 - in python/branches/p3yk/Lib: io.py test/test_io.py

Thu Apr 12 21:32:08 CEST 2007

Guido van Rossum wrote:

> On 4/12/07, Walter Dörwald <walter at livinglogic.de> wrote:
>> Guido van Rossum wrote:
>> > On 4/11/07, Walter Dörwald <walter at livinglogic.de> wrote:
>> >> Would it make sense to make the state of the decoder public, e.g. by
>> >> adding setstate() and getstate() methods? This would give a cleaner 
>> API.
>> >
>> > I've been thinking of the same thing!
>> >
>> > I wonder if it would be possible to return the state as a pair
>> > (unread, flags) where unread is a (byte) string of unprocessed bytes
>> > and flags is some other state, with the constraint that in the initial
>> > state the flags must be zero. Then I can optimize the case where flags
>> > is returned as zero by subtracting len(unread) from the current
>> > position and that'd be the correct seek position.
>>
>> I'd say that bytestream.tell() is the correct position.
>>
>> Or should seek() return to the last position where the codec was in a
>> default state without anything buffered? (This can't work for UTF-16,
>> because the codec almost never is in the default state.)
> 
> That was my hope, yes (and I realize that UTF-16 is an exception).

We could designate natural endianness as the default state, but that 
would mean that a codec state can't be transferred to a different 
machine (or we could declare little (or big) endianness to be the 
default state).

> Consider UTF-8 though. If the chunk we read from the byte stream ended
> in the middle of a multi-byte character, the codec will have the first
> part of that character buffered. In general we want to subtract
> buffered data from the byte stream's position when reporting the
> position of the text stream. The idea is that if we later seek to the
> reported position, we should be reading the same character data. This
> can be accomplished in two ways: by backing up the byte stream to the
> previous character boundary, and resetting the decoder to neutral; or
> by positioning the byte stream to where it was originally and setting
> the state of the decoder to what it was before. However, backing up
> the byte stream has the advantage that no decoder state needs to be
> encoded in the position cookie.

OK, so for decoders getstate() should always return a tuple, with the 
first entry being the buffered byte string (or bytes object?) and the 
second being additional state data.

Do we need any specification for encoders?

>> > I imagine most
>> > decoders have only very few flags they care about. (The worst might be
>> > the utf-16 decoder which must have a flag to remember whether it
>> > already saw a byte order marker, and another indicating the byte
>> > order. Maybe we'll have to special-case that one, so don't worry too
>> > much about it.)
>> >
>> >> Should I work on a patch?
>> >
>> > That would be great!
>>
>> OK, here's the patch: http://bugs.python.org/1698994
>>
>> The state returned from getstate() should be treated as an opaque value
>> (e.g. for the buffered incremental codecs it is the buffered string, for
>> the UTF-16 encoder it's the flag indicating whether a BOM has been
>> written etc.). The codecs try to return None, if they are in some kind
>> of default state (e.g. there's nothing buffered).
> 
> I would like to await completion of those unit tests;

The second version of the patch includes the unit tests (and fixes the 
utf-8-sig codec).

> there seem to be
> some subtle issues.

Can you be more concrete?

> I wonder if setstate() should call self.reset()
> first.

Calling reset() and calling setstate() with the initial state should 
have the same effect.

> I'd also like to ask if setstate() could default to "" only if
> the argument is None, not if it is empty; I'd like to use it to change
> the buffer to be a bytes object.

I'd say for Python 3000 it should always be a bytes object. Will this 
interoperate seamlessly with the C part of the codec machinery?

> (And yes, I need to maintain more
> hacks for that, alas).

I'l try to update the patch tomorrow or over the weekend.

Servus,
    Walter