[issue2980] Pickle stream for unicode object may contain non-ASCII characters.

Tue Oct 21 11:46:38 CEST 2008

Marc-Andre Lemburg <mal at egenix.com> added the comment:

On 2008-10-21 11:22, Dan Dibagh wrote:
> Your reasoning shows a lack of understanding how Python is actually used
> from a programmers point of view.

Hmm, I've been using Python for almost 15 years now and do believe
that I have an idea of how Python is being used.

Note that we cannot change the pickle format in retrospective, since
this would break pickle data exchange between different Python versions
relying on the same format (but using different pickle implementations).

What we could do is add a new pickle format which then escapes all
non-ASCII data. However, people have been more keen on getting
compact and fast loading pickles than pickles in ASCII which is why
all new versions of the pickle format are binary formats, so I don't
think it's worth the effort.

Note that the common way of dealing with binary data in ASCII streams
is using a base64 encoding and possibly also apply compression. The
pickle 0 format is really only useful for debugging purposes.

> Perhaps it is the raw-unicode-escape encoding that should be fixed? I
> failed to find exact information about what raw-unicode-escape means. In
> particular, where is the information which states that
> raw-unicode-escape is always an 8-bit format? The closest I've come is
> PEP 100 and PEP 263 (which I notice is written by you guys), which
> describes how to decode raw unicode escape strings from Python source
> and how to define encoding formats for python source code. The sole
> original purpose of both unicode-escape and raw-unicode-escape appears
> to be representing unicode strings in Python source code as u' and ur'
> strings respectively. 

Right.

> It is clear that the decoding of a raw unicode
> escaped or unicode escaped string depends on the actual encoding of the
> python source, but how goes the logic that when something is _encoded_
> into a raw unicode string then the target source must be of some 8-bit
> encoding. Especially considering that the default python source encoding
> is ASCII. For unicode-escape this makes sense:
> 
>>>> f = file("test.py", "wb")
>>>> f.write('s = u"%s"\n' % u"\u0080".encode("unicode-escape"))
>>>> f.close()
>>>> ^Z
> 
> python test.py (executes silently without errors)
> 
> But for raw-unicode-escape the outcome is a different thing:
> 
>>>> f = file("test.py", "wb")
>>>> f.write('s = ur"%s"\n' % u"\u0080".encode("raw-unicode-escape"))
>>>> f.close()
>>>> ^Z
> 
> python test.py
> 
>   File "test.py", line 1
> SyntaxError: Non-ASCII character '\x80' in file test.py on line 1, but
> no encoding declared; see http://www.python.org/peps/pep-0263.html for
> details
> 
> Huh? For someone who trusts the Standard Encodings section Python
> Library reference this isn't what one would expect. If the documentation
> states "Produce a string that is suitable as raw Unicode literal in
> Python source code" then why isn't it suitable?

Because the raw-unicode-escape codec won't escape the \x80 character,
hence the name. As a result, the generated source code is not ASCII,
which is why you see the exception.

But this is off-topic w/r to the issue in question.

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue2980>
_______________________________________