[Python-3000] Heaptypes

Guido van Rossum guido at python.org
Thu Jul 19 20:32:14 CEST 2007


On 7/19/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> >> __reduce__ currently does (O(s#)) with (ob_type, ob_bytes, ob_size).
> >> Now, s# creates a Unicode object, and the pickling fails to round-trip
> >> correctly.
> >
> > I thought that before your patch a bytes object roundtripped correctly
> > with all three protocols. Or maybe it got broken when s# was changed?
>
> It did, and it got. s# used to return a str8, which then was pickled
> byte-for-byte. When s# started to return Unicode strings, bytes
> above 128 got widened to Py_UNICODE (which is what currently
> PyUnicode_FromString does), so b'\xFF' became bytes('\uFFFF').

Ouch!!! This turns out to be a bug in PyUnicode_FronStringAndSize()
due to signed characters. It can even cause a segfault:

Python 3.0x (py3k-struni, Jul 18 2007, 11:01:59)
[GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> b"\x80".__reduce__()
Segmentation fault

Fixed by applying Py_CHARMASK() to all occurrences of *u in that function.
Committed revision 56460.

> That got pickled and unpickled; then bytes('\uFFFF') is
> b'\xef\xbf\xbf' (because it applies the default encoding to
> the unicode argument), and it failed to roundtrip to b'\xFF'.
>
> It's actually not possible to generate b'\xFF' using
> a unicode string argument, as string the default encoding will
> never return s'\xFF' (as that's not valid UTF-8).

But you can do it using bytes('\xff', 'latin-1'). I think that's a
reasonable thing for bytes.__reduce__() to return.

> > An additional requirement might be that if bytes are introduced in
> > 2.6, a pickle containing bytes written by 3.0 should be readable by
> > 2.6.
>
> Sure: whatever we decide now needs to be applied to 2.6 also.

Right.

> >> If __reduce__ returns a Unicode object, what encoding should be assumed?
> >> (which then needs to be symmetric with bytes())
> >>
> >> If __reduce__ returns a str8 object, you will have to keep str8 (or
> >> else you cannot pickle bytes).
> >
> > When __reduce__ returns a string at all, that means it's the name of a
> > global. I guess that should be encoded using UTF-8, so that as long as
> > the name is ASCII, 2.x can unpickle it. But I'm not sure if that's
> > what you were asking.
>
> No.
> py> b'foo'.__reduce__()
> (<type 'bytes'>, ('foo',))
> py> b'\xff'.__reduce__()
> (<type 'bytes'>, ('\uffff',))
>
> It returns one string each time, as the first element of a one-element
> tuple (that is then passed to the bytes() constructor on unpickling)

I see. It returns a tuple containing a string. I was confused. Sorry.
(But the \uffff is due to the bug above.)

> > Anyway, one reason this is such a mess is clearly that the pickle
> > protocol has no independent spec -- it's grown organically in code.
> > Reverse-engineering the intent of the code is a pain.
>
> That's also true, but I don't see it much as a problem here. If it
> had a spec, that spec would have said that b'S', b'T' and b'U'
> have a str payload. That spec would break if str8 goes away, and
> the spec would be changed to explain how these codes act in 2.x
> and 3.x. It would not talk at all about the bytes type, and that
> it's __reduce__ might return different things in 2.x and 3.x
> (unless bytes gets a primitive code for pickle).

How about the following. it's not perfect but it's the best I can
think of that doesn't break any pickles.

In 3.0, when an S, T or U pickle code is encountered, the returned
value is a Unicode string decoded from the bytes using Latin-1. This
means that all S, T or U pickle codes returns Unicode objects. In
those cases where this was really meant to transfer binary data, the
application running under 3.0 can fix this by calling bytes(X,
'latin-1'). If it was meant to be UTF-8-encoded text, the app can call
str(Y, 'utf-8') after that.

But 3.0 should only *generate* the S, T or U pickle codes for str8
values (as long as that type exists) or for str values containing only
7-bit ASCII bytes; for all else it should use the unicode pickle
codes.

For bytes, I propose that b"ab\xff".__reduce__() return (bytes,
("ab\xff", "latin-1")).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-3000 mailing list