[issue2980] Pickle stream for unicode object may contain non-ASCII characters.

Dan Dibagh report at bugs.python.org
Fri Oct 24 13:48:44 CEST 2008


Dan Dibagh <dddibagh at lavabit.com> added the comment:

> Which PEP specifically? PEP 263 only mentions the unicode-escape
> encoding in its problem statement, i.e. as a pre-existing thing.
> It doesn't specify it, nor does it give a rationale for why it behaves
> the way it does.

PEP 100 and PEP 263. What I looked for was a description of the
functional intention and a technical definition of raw unicode escape.
The term "raw" tends to have different meanings depending on the context
in which it appears. PEP 263 is of interest in the overall understanding
of the intention of raw unicode escape. If raw unicode escape is to
convert from python source into unicode strings then the decoding of raw
unicode escape strings depends on the source code encoding. Then perhaps
it would give an idea what the encoding part is supposed to do... PEP
100 is of interest for the technical description. It describes the
section "unicode constructors" as the definition.

> What code are you looking at, and where do you find it difficult to
> follow it? Maybe you get confused between the "unicode-escape" codec,
> and the "raw-unicode-escape" codec, also.

Since it is the issue with non-ASCII characters in pickle output I look
at, it is raw-unicode-escape being in focus. For the decoding bit the
distinction between unicode-escape and raw-unicode-escape is very clear. 

I look at the function PyUnicode_EncodeRawUnicodeEscape in
Objects/unicodeobject.c. At the point of the comment "/* Copy everything
else as-is */", given the perceived intentions of the encoding type, I
try to figure out why there isn't a "/* Map non-printable US ASCII to
'\xhh' */" section like in the unicodeescape_string function. The
background in older pythons you explained is essentially what I guessed.

> The raw-unicode-escape codec? It was designed to support parsing of
> Python 2.0 source code, and of "raw" unicode strings (ur"") in
> particular. In Python 2.0, you only needed to escape characters above
> U+0100; Latin-1 characters didn't need escaping. Python, itself, only
> relied on the decoding directory. That the codec choses not to escape
> Latin-1 characters on encoding is an arbitrary choice (I guess); it's
> still symmetric with decoding.

I suppose you mean symmetric with decoding as long as you stick to the
latin-1 character set, as raw unicode escaping isn't a one-to-one mapping.

When PEP 263 came into the picture, wouldn't it have made sense to
change PyUnicode_EncodeRawUnicodeEscape to produce ASCII-only output, or
perhaps output conforming to the current default encoding? Given the
intention of the raw unicode escape, encoding something with it means
producing python source code. But it is in latin-1 while the rest of
Python has moved on to use ASCII by default or whatever being configured
in the source. I tried to put shine on that problem in my previous example.

> Even though the choice was arbitrary, you shouldn't change it now,
> because people may rely on how this codec works.

> Applications might rely on what was implemented rather than what was
> specified. If they had implemented their own pickle readers, such
> readers might break if the pickle format is changed. In principle, 
> even the old pickle readers of Python 2.0..2.6 might break if the
>format changes in 2.7 - we would have to go back and check that they don't
> break (although I do believe that they would work fine).

Then let me ask: How far reaching is the aim to maintain compatibility
with programs which depends on Python internals? Even if the internal
thing is a bug and the thing which depends on the bug is also a bug?
Maybe it is a provoking question, let me explain. The question(s)
applies to some extent to the workings of the codec but it is really the
pickle problem I think of. In the case of older Python releases, it is
just a matter of testing, just as you say. It is boring and perhaps
tedious but there is nothing special which prevents it from being done.
If there are many versions there ought to be a way to write a program
which does it automatically. 

In the case of those who have implemented their own pickle readers, the
source and the comments in pickletools.py clearly states that unicode
strings are raw unicode escaped in format 0. Now raw unicode escape
isn't a canonical format. The letter A can be represented either as
\u0041 or as itself as A. If a hypothetical implementor gets the idea
that characters in the range 0-255 cannot be represented by \u00xx
sequences then the fact that pickle replaces \ with \u005c and \n with
\u000a should give a hint that he is wrong. So if characters in the
range 128-255 gets escaped with \u00xx any pickle reader should handle
it. I've tried to come up with some sensible way to write a pickle
implemenation which fails to understand \u00xx characters without
calling it a bug. I cannot. Can you? So it seems that the worry for
changing protocol 0 is buggy programs depending on a pickle bug.

In the other end of the spectrum there are correct programs with depends
on Python externals, ie. programs depending in ASCII-conformant pickle
output (even if there are some base64 ...ehm... fundamentalists who
think it is the wrong way to do it -- I can think of at least one good
reason to do it).   

> So contributions are welcome. If you find that the patch meets
> resistance, you also need to write a PEP, and ask for BDFL
> pronouncement.

I consider doing a patch. I also understand that in order for the patch
to get acceptance it must fit into the Python framework. That's why I
ask all these questions.

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue2980>
_______________________________________


More information about the Python-bugs-list mailing list