[Python-bugs-list] [Bug #123634] Pickle broken on Unicode strings

noreply@sourceforge.net noreply@sourceforge.net
Thu, 28 Dec 2000 01:37:40 -0800


Bug #123634, was updated on 2000-Nov-27 14:03
Here is a current snapshot of the bug.

Project: Python
Category: Unicode
Status: Closed
Resolution: Fixed
Bug Group: None
Priority: 5
Submitted by: tlau
Assigned to : gvanrossum
Summary: Pickle broken on Unicode strings

Details: Two one-liners that produce incorrect output:

>>> cPickle.loads(cPickle.dumps(u''))
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
cPickle.UnpicklingError: pickle data was truncated
>>> cPickle.loads(cPickle.dumps(u'\u03b1 alpha\n\u03b2 beta'))
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
cPickle.UnpicklingError: invalid load key, '\'.

The format of the Unicode string in the pickled representation is not
escaped, as it is with regular strings.  It should be.  The latter bug
occurs in both pickle and cPickle; the former is only a problem with
cPickle.

Follow-Ups:

Date: 2000-Dec-28 01:37
By: nobody

Comment:
Sorry, I looked at the fix proposed by "tlau". The CVS version
is just fine :-) 
--
Marc-Andre
-------------------------------------------------------

Date: 2000-Dec-27 14:40
By: gvanrossum

Comment:
Your fix is backwards incompatible. Mine is compatible for strings not
containing backslashes.

I don't understand your comment about avoiding eval(): the code doesn't use
eval() (and didn't before I changed it), while your patch *adds* use of
eval().

-------------------------------------------------------

Date: 2000-Dec-20 11:18
By: nobody

Comment:
About your fix: this is not the solution I had in mind. I wanted
to avoid the problems and performance hit by not using an
encoding which requires eval() to build the Unicode object.

Wouldn't the solution I proposed be both easier to implement
and safe us from adding eval() to pickle et al. ?!

--
Marc-Andre
-------------------------------------------------------

Date: 2000-Dec-18 18:10
By: gvanrossum

Comment:
Fixed in both pickle.py (rev. 1.41) and cPickle.py (rev. 2.54).

I've also checked in tests for these and similar endcases.
-------------------------------------------------------

Date: 2000-Nov-27 14:36
By: tlau

Comment:
One more comment: binary-format pickles are not affected, only text-format
pickles.  Thus the part of my patch that applies to the binary section of
the save_unicode function should not be applied.
-------------------------------------------------------

Date: 2000-Nov-27 14:35
By: lemburg

Comment:
Some background (no time to fix this myself):

When I added the Unicode handlers, I wanted to avoid the
problems that the string dump mechanism has with
quoted strings. The encodings used either carry length information
(in binary mode: UTF-8) or do not include the \n character
(in ascii mode: raw-unicode-escape encoding). 

Unfortunately, the raw-unicode-escape codec does not
escape the newline character which is used by pickle
to break the input into tokens.... 

Proposed fix: change the encoding to "unicode-escape"
which doesn't have this problem. This will break code,
but only code that is already broken :-/

-------------------------------------------------------

Date: 2000-Nov-27 14:20
By: tlau

Comment:
Here's my proposed patch to Lib/pickle.py (cPickle should be changed
similarly):

--- /scratch/tlau/Python-2.0/Lib/pickle.py      Mon Oct 16 14:49:51 2000
+++ pickle.py   Mon Nov 27 14:07:01 2000
@@ -286,9 +286,9 @@
             encoding = object.encode('utf-8')
             l = len(encoding)
             s = mdumps(l)[1:]
-            self.write(BINUNICODE + s + encoding)
+            self.write(BINUNICODE + `s` + encoding)
         else:
-            self.write(UNICODE + object.encode('raw-unicode-escape') +
'\n')
+            self.write(UNICODE + `object.encode('raw-unicode-escape')` +
'\n')
 
         memo_len = len(memo)
         self.write(self.put(memo_len))
@@ -627,7 +627,12 @@
     dispatch[BINSTRING] = load_binstring
 
     def load_unicode(self):
-        self.append(unicode(self.readline()[:-1],'raw-unicode-escape'))
+        rep = self.readline()[:-1]
+        if not self._is_string_secure(rep):
+            raise ValueError, "insecure string pickle"
+        rep = eval(rep,
+                   {'__builtins__': {}}) # Let's be careful
+        self.append(unicode(rep, 'raw-unicode-escape'))
     dispatch[UNICODE] = load_unicode
 
     def load_binunicode(self):

-------------------------------------------------------

Date: 2000-Nov-27 14:14
By: gvanrossum

Comment:
Jim, do you have time to look into this?
-------------------------------------------------------

For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=123634&group_id=5470