[Python-3000] Heaptypes

Thu Jul 19 22:26:35 CEST 2007

> But you can do it using bytes('\xff', 'latin-1'). I think that's a
> reasonable thing for bytes.__reduce__() to return.

That's certainly a choice. Another choice is that bytes defaults to
latin-1, rather than the system default encoding. This is roughly
equivalent, and gives a slightly more compact pickle result.

> How about the following. it's not perfect but it's the best I can
> think of that doesn't break any pickles.
> 
> In 3.0, when an S, T or U pickle code is encountered, the returned
> value is a Unicode string decoded from the bytes using Latin-1. This
> means that all S, T or U pickle codes returns Unicode objects. In
> those cases where this was really meant to transfer binary data, the
> application running under 3.0 can fix this by calling bytes(X,
> 'latin-1'). If it was meant to be UTF-8-encoded text, the app can call
> str(Y, 'utf-8') after that.

It would actually have to be Y.encode('latin-1').decode('utf-8')
(assuming Y is what you get from unpickling):

py> str('\xc3\xb6', 'utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: decoding Unicode is not supported

> But 3.0 should only *generate* the S, T or U pickle codes for str8
> values (as long as that type exists) or for str values containing only
> 7-bit ASCII bytes; for all else it should use the unicode pickle
> codes.

Sounds fine to me.

> For bytes, I propose that b"ab\xff".__reduce__() return (bytes,
> ("ab\xff", "latin-1")).

See above. Unless somebody objects, I'd rather make latin-1 the
default for bytes when a string is passed (I'm uncertain myself
of how much explicit is better than implicit here).

I'll look into implementing that strategy.

Regards,
Martin