[Python-3000] Heaptypes

Thu Jul 19 09:06:58 CEST 2007

>> __reduce__ currently does (O(s#)) with (ob_type, ob_bytes, ob_size).
>> Now, s# creates a Unicode object, and the pickling fails to round-trip
>> correctly.
> 
> I thought that before your patch a bytes object roundtripped correctly
> with all three protocols. Or maybe it got broken when s# was changed?

It did, and it got. s# used to return a str8, which then was pickled
byte-for-byte. When s# started to return Unicode strings, bytes
above 128 got widened to Py_UNICODE (which is what currently
PyUnicode_FromString does), so b'\xFF' became bytes('\uFFFF').
That got pickled and unpickled; then bytes('\uFFFF') is
b'\xef\xbf\xbf' (because it applies the default encoding to
the unicode argument), and it failed to roundtrip to b'\xFF'.

It's actually not possible to generate b'\xFF' using
a unicode string argument, as string the default encoding will
never return s'\xFF' (as that's not valid UTF-8).

> An additional requirement might be that if bytes are introduced in
> 2.6, a pickle containing bytes written by 3.0 should be readable by
> 2.6.

Sure: whatever we decide now needs to be applied to 2.6 also.

>> If __reduce__ returns a Unicode object, what encoding should be assumed?
>> (which then needs to be symmetric with bytes())
>>
>> If __reduce__ returns a str8 object, you will have to keep str8 (or
>> else you cannot pickle bytes).
> 
> When __reduce__ returns a string at all, that means it's the name of a
> global. I guess that should be encoded using UTF-8, so that as long as
> the name is ASCII, 2.x can unpickle it. But I'm not sure if that's
> what you were asking.

No.
py> b'foo'.__reduce__()
(<type 'bytes'>, ('foo',))
py> b'\xff'.__reduce__()
(<type 'bytes'>, ('\uffff',))

It returns one string each time, as the first element of a one-element
tuple (that is then passed to the bytes() constructor on unpickling)

> Anyway, one reason this is such a mess is clearly that the pickle
> protocol has no independent spec -- it's grown organically in code.
> Reverse-engineering the intent of the code is a pain.

That's also true, but I don't see it much as a problem here. If it
had a spec, that spec would have said that b'S', b'T' and b'U'
have a str payload. That spec would break if str8 goes away, and
the spec would be changed to explain how these codes act in 2.x
and 3.x. It would not talk at all about the bytes type, and that
it's __reduce__ might return different things in 2.x and 3.x
(unless bytes gets a primitive code for pickle).

Regards,
Martin