[Python-3000] [Python-3000-checkins] r54742 - in python/branches/p3yk/Lib: io.py test/test_io.py

Guido van Rossum guido at python.org
Thu Apr 12 23:48:21 CEST 2007

> >> > I wonder if it would be possible to return the state as a pair
> >> > (unread, flags) where unread is a (byte) string of unprocessed bytes
> >> > and flags is some other state, with the constraint that in the initial
> >> > state the flags must be zero. Then I can optimize the case where flags
> >> > is returned as zero by subtracting len(unread) from the current
> >> > position and that'd be the correct seek position.
> >>
> >> I'd say that bytestream.tell() is the correct position.
> >>
> >> Or should seek() return to the last position where the codec was in a
> >> default state without anything buffered? (This can't work for UTF-16,
> >> because the codec almost never is in the default state.)
> >
> > That was my hope, yes (and I realize that UTF-16 is an exception).
> We could designate natural endianness as the default state, but that
> would mean that a codec state can't be transferred to a different
> machine (or we could declare little (or big) endianness to be the
> default state).

I think it's okay for file positions involving codec states not to be
tranferable between platforms. I think they wouldn't even be
guaranteed between subsequent runs of the same program.

> > Consider UTF-8 though. If the chunk we read from the byte stream ended
> > in the middle of a multi-byte character, the codec will have the first
> > part of that character buffered. In general we want to subtract
> > buffered data from the byte stream's position when reporting the
> > position of the text stream. The idea is that if we later seek to the
> > reported position, we should be reading the same character data. This
> > can be accomplished in two ways: by backing up the byte stream to the
> > previous character boundary, and resetting the decoder to neutral; or
> > by positioning the byte stream to where it was originally and setting
> > the state of the decoder to what it was before. However, backing up
> > the byte stream has the advantage that no decoder state needs to be
> > encoded in the position cookie.
> OK, so for decoders getstate() should always return a tuple, with the
> first entry being the buffered byte string (or bytes object?) and the
> second being additional state data.
> Do we need any specification for encoders?

I don't need this for encoders at all -- we don't use incremental
encoders, only incremental decoders.

> >> The state returned from getstate() should be treated as an opaque value
> >> (e.g. for the buffered incremental codecs it is the buffered string, for
> >> the UTF-16 encoder it's the flag indicating whether a BOM has been
> >> written etc.). The codecs try to return None, if they are in some kind
> >> of default state (e.g. there's nothing buffered).
> >
> > I would like to await completion of those unit tests;
> The second version of the patch includes the unit tests (and fixes the
> utf-8-sig codec).
> > there seem to be
> > some subtle issues.
> Can you be more concrete?

I think I just meant the str/bytes issue I already mentioned.

> > I wonder if setstate() should call self.reset()
> > first.
> Calling reset() and calling setstate() with the initial state should
> have the same effect.

OK, I should do that anyway. (I wasn't aware of reset() until I saw
your patch. ;-)

> > I'd also like to ask if setstate() could default to "" only if
> > the argument is None, not if it is empty; I'd like to use it to change
> > the buffer to be a bytes object.
> I'd say for Python 3000 it should always be a bytes object.

Eventually, yes. But right now we're in a world where sometimes there
are bytes and sometimes there are (8-bit) strings -- and I'd like to
get as many tests passing with the new IO library without making it
the default first.

> Will this
> interoperate seamlessly with the C part of the codec machinery?

It should if it uses the buffer API as it should. When I encounter
places where it requires 8-bit strings I'll fix them

> > (And yes, I need to maintain more
> > hacks for that, alas).
> I'l try to update the patch tomorrow or over the weekend.


--Guido van Rossum (home page: http://www.python.org/~guido/)

More information about the Python-3000 mailing list