Walter Dörwald walter at livinglogic.de
Fri Apr 13 13:49:47 CEST 2007

Guido van Rossum wrote:

>> >> > I wonder if it would be possible to return the state as a pair
>> >> > (unread, flags) where unread is a (byte) string of unprocessed bytes
>> >> > and flags is some other state, with the constraint that in the
>> initial
>> >> > state the flags must be zero. Then I can optimize the case where
>> flags
>> >> > is returned as zero by subtracting len(unread) from the current
>> >> > position and that'd be the correct seek position.
>> >>
>> >> I'd say that bytestream.tell() is the correct position.
>> >>
>> >> Or should seek() return to the last position where the codec was in a
>> >> default state without anything buffered? (This can't work for UTF-16,
>> >> because the codec almost never is in the default state.)
>> >
>> > That was my hope, yes (and I realize that UTF-16 is an exception).
>> We could designate natural endianness as the default state, but that
>> would mean that a codec state can't be transferred to a different
>> machine (or we could declare little (or big) endianness to be the
>> default state).
> I think it's okay for file positions involving codec states not to be
> tranferable between platforms. I think they wouldn't even be
> guaranteed between subsequent runs of the same program.

OK, done in the third version of the patch.

>> > Consider UTF-8 though. If the chunk we read from the byte stream ended
>> > in the middle of a multi-byte character, the codec will have the first
>> > part of that character buffered. In general we want to subtract
>> > buffered data from the byte stream's position when reporting the
>> > position of the text stream. The idea is that if we later seek to the
>> > reported position, we should be reading the same character data. This
>> > can be accomplished in two ways: by backing up the byte stream to the
>> > previous character boundary, and resetting the decoder to neutral; or
>> > by positioning the byte stream to where it was originally and setting
>> > the state of the decoder to what it was before. However, backing up
>> > the byte stream has the advantage that no decoder state needs to be
>> > encoded in the position cookie.
>> OK, so for decoders getstate() should always return a tuple, with the
>> first entry being the buffered byte string (or bytes object?) and the
>> second being additional state data.
>> Do we need any specification for encoders?
> I don't need this for encoders at all -- we don't use incremental
> encoders, only incremental decoders.

True for reading, but what about writing?

>> >> The state returned from getstate() should be treated as an opaque
>> value
>> >> (e.g. for the buffered incremental codecs it is the buffered
>> string, for
>> >> the UTF-16 encoder it's the flag indicating whether a BOM has been
>> >> written etc.). The codecs try to return None, if they are in some kind
>> >> of default state (e.g. there's nothing buffered).
>> >
>> > I would like to await completion of those unit tests;
>> The second version of the patch includes the unit tests (and fixes the
>> utf-8-sig codec).
>> > there seem to be
>> > some subtle issues.
>> Can you be more concrete?
> I think I just meant the str/bytes issue I already mentioned.

Since the new version never sets the buffer to an explicit value except
in the constructor this problem should have disappeared.

>> > I wonder if setstate() should call self.reset()
>> > first.
>> Calling reset() and calling setstate() with the initial state should
>> have the same effect.
> OK, I should do that anyway. (I wasn't aware of reset() until I saw
> your patch. ;-)
>> > I'd also like to ask if setstate() could default to "" only if
>> > the argument is None, not if it is empty; I'd like to use it to change
>> > the buffer to be a bytes object.
>> I'd say for Python 3000 it should always be a bytes object.
> Eventually, yes. But right now we're in a world where sometimes there
> are bytes and sometimes there are (8-bit) strings -- and I'd like to
> get as many tests passing with the new IO library without making it
> the default first.


>> Will this
>> interoperate seamlessly with the C part of the codec machinery?
> It should if it uses the buffer API as it should. When I encounter
> places where it requires 8-bit strings I'll fix them
> opportunistically.
>> > (And yes, I need to maintain more
>> > hacks for that, alas).
>> I'l try to update the patch tomorrow or over the weekend.
> Thanks!

Done. I've also added documentation (The description of the constraints
on the decoder state sounds quite esoteric ;)).


