[issue2980] Pickle stream for unicode object may contain non-ASCII characters.

Tue Oct 21 11:22:24 CEST 2008

Dan Dibagh <dddibagh at lavabit.com> added the comment:

Your reasoning shows a lack of understanding how Python is actually used
from a programmers point of view.

Why do you think that "noticing" a problem is the same thing as entering
as a python bug report? In practice there are several steps between
noticing a problem in a python program and entering it as a bug report
in the python development system. It is very difficult so see why any of
these steps would happen automatically. Believe me, people have had real
problems due to this bug. They have just selected other solutions than
reporting it.

You are yourself reluctant so seek out the roots of this problem and fix
it. Why should other people behave differently and report it? A not so
uncommon "fix" to pickle problems out there is to not using pickle at
all. There are Python programmers who gives the advice to avoid pickle
since "it's too shaky". It is a solution, but is it the solution you
desire? 

The capability to serialize stuff into ASCII strings isn't just an
implementation detail that happens to be nice for human readability. It
is a feature people need for technical reasons. If the data is ASCII, it
can be dealt with in any ASCII-compatible context which might be network
protocols, file formats and database interfaces. There is the real use.
Programs depend on it to work properly.

The solution the change the documentation is in practice breaking
compatibility (which programming language designers normally tries to
avoid or do in a very controlled manner). How is a documentation fix
going to help all the code out there written with the assumption that
pickle protocol 0 is always ASCII? Is there a better solution around
than changing pickle to meet actual expectations?

Well, nobody has reported it as a bug in 8 years. How long do you think
that code will stay around based on the ASCII assumption? 8 years? 16
years? 24 years? Maybe all the time in the world for this to become an
issue again and again and again?

It is difficult to grasp why there is "no way to fix it now". From a
programmers point of view an obvious "fix" is to ditch pickle and use
something that delivers a consistent result rather than debugging hours.
When I try to see it from the Python library developers point of view I
see code implemented in C which produces a result with reasonable
performance. It is perfectly possible to write the code which implements
the expected result within reasonable performance. What is the problem?

Perhaps it is the raw-unicode-escape encoding that should be fixed? I
failed to find exact information about what raw-unicode-escape means. In
particular, where is the information which states that
raw-unicode-escape is always an 8-bit format? The closest I've come is
PEP 100 and PEP 263 (which I notice is written by you guys), which
describes how to decode raw unicode escape strings from Python source
and how to define encoding formats for python source code. The sole
original purpose of both unicode-escape and raw-unicode-escape appears
to be representing unicode strings in Python source code as u' and ur'
strings respectively. It is clear that the decoding of a raw unicode
escaped or unicode escaped string depends on the actual encoding of the
python source, but how goes the logic that when something is _encoded_
into a raw unicode string then the target source must be of some 8-bit
encoding. Especially considering that the default python source encoding
is ASCII. For unicode-escape this makes sense:

>>> f = file("test.py", "wb")
>>> f.write('s = u"%s"\n' % u"\u0080".encode("unicode-escape"))
>>> f.close()
>>> ^Z

python test.py (executes silently without errors)

But for raw-unicode-escape the outcome is a different thing:

>>> f = file("test.py", "wb")
>>> f.write('s = ur"%s"\n' % u"\u0080".encode("raw-unicode-escape"))
>>> f.close()
>>> ^Z

python test.py

  File "test.py", line 1
SyntaxError: Non-ASCII character '\x80' in file test.py on line 1, but
no encoding declared; see http://www.python.org/peps/pep-0263.html for
details

Huh? For someone who trusts the Standard Encodings section Python
Library reference this isn't what one would expect. If the documentation
states "Produce a string that is suitable as raw Unicode literal in
Python source code" then why isn't it suitable?

----------
nosy: +dddibagh

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue2980>
_______________________________________