[Python-3000] [Python-3000-checkins] r54742 - in python/branches/p3yk/Lib: io.py test/test_io.py

Thu Apr 12 19:30:04 CEST 2007

On 4/12/07, Walter Dörwald <walter at livinglogic.de> wrote:
> Guido van Rossum wrote:
> > On 4/11/07, Walter Dörwald <walter at livinglogic.de> wrote:
> >> Would it make sense to make the state of the decoder public, e.g. by
> >> adding setstate() and getstate() methods? This would give a cleaner API.
> >
> > I've been thinking of the same thing!
> >
> > I wonder if it would be possible to return the state as a pair
> > (unread, flags) where unread is a (byte) string of unprocessed bytes
> > and flags is some other state, with the constraint that in the initial
> > state the flags must be zero. Then I can optimize the case where flags
> > is returned as zero by subtracting len(unread) from the current
> > position and that'd be the correct seek position.
>
> I'd say that bytestream.tell() is the correct position.
>
> Or should seek() return to the last position where the codec was in a
> default state without anything buffered? (This can't work for UTF-16,
> because the codec almost never is in the default state.)

That was my hope, yes (and I realize that UTF-16 is an exception).
Consider UTF-8 though. If the chunk we read from the byte stream ended
in the middle of a multi-byte character, the codec will have the first
part of that character buffered. In general we want to subtract
buffered data from the byte stream's position when reporting the
position of the text stream. The idea is that if we later seek to the
reported position, we should be reading the same character data. This
can be accomplished in two ways: by backing up the byte stream to the
previous character boundary, and resetting the decoder to neutral; or
by positioning the byte stream to where it was originally and setting
the state of the decoder to what it was before. However, backing up
the byte stream has the advantage that no decoder state needs to be
encoded in the position cookie.

> > I imagine most
> > decoders have only very few flags they care about. (The worst might be
> > the utf-16 decoder which must have a flag to remember whether it
> > already saw a byte order marker, and another indicating the byte
> > order. Maybe we'll have to special-case that one, so don't worry too
> > much about it.)
> >
> >> Should I work on a patch?
> >
> > That would be great!
>
> OK, here's the patch: http://bugs.python.org/1698994
>
> The state returned from getstate() should be treated as an opaque value
> (e.g. for the buffered incremental codecs it is the buffered string, for
> the UTF-16 encoder it's the flag indicating whether a BOM has been
> written etc.). The codecs try to return None, if they are in some kind
> of default state (e.g. there's nothing buffered).

I would like to await completion of those unit tests; there seem to be
some subtle issues. I wonder if setstate() should call self.reset()
first. I'd also like to ask if setstate() could default to "" only if
the argument is None, not if it is empty; I'd like to use it to change
the buffer to be a bytes object. (And yes, I need to maintain more
hacks for that, alas).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)