[Python-bugs-list] [Bug #123634] Pickle broken on Unicode strings
noreply@sourceforge.net
noreply@sourceforge.net
Thu, 28 Dec 2000 01:37:40 -0800
Bug #123634, was updated on 2000-Nov-27 14:03
Here is a current snapshot of the bug.
Project: Python
Category: Unicode
Status: Closed
Resolution: Fixed
Bug Group: None
Priority: 5
Submitted by: tlau
Assigned to : gvanrossum
Summary: Pickle broken on Unicode strings
Details: Two one-liners that produce incorrect output:
>>> cPickle.loads(cPickle.dumps(u''))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
cPickle.UnpicklingError: pickle data was truncated
>>> cPickle.loads(cPickle.dumps(u'\u03b1 alpha\n\u03b2 beta'))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
cPickle.UnpicklingError: invalid load key, '\'.
The format of the Unicode string in the pickled representation is not
escaped, as it is with regular strings. It should be. The latter bug
occurs in both pickle and cPickle; the former is only a problem with
cPickle.
Follow-Ups:
Date: 2000-Dec-28 01:37
By: nobody
Comment:
Sorry, I looked at the fix proposed by "tlau". The CVS version
is just fine :-)
--
Marc-Andre
-------------------------------------------------------
Date: 2000-Dec-27 14:40
By: gvanrossum
Comment:
Your fix is backwards incompatible. Mine is compatible for strings not
containing backslashes.
I don't understand your comment about avoiding eval(): the code doesn't use
eval() (and didn't before I changed it), while your patch *adds* use of
eval().
-------------------------------------------------------
Date: 2000-Dec-20 11:18
By: nobody
Comment:
About your fix: this is not the solution I had in mind. I wanted
to avoid the problems and performance hit by not using an
encoding which requires eval() to build the Unicode object.
Wouldn't the solution I proposed be both easier to implement
and safe us from adding eval() to pickle et al. ?!
--
Marc-Andre
-------------------------------------------------------
Date: 2000-Dec-18 18:10
By: gvanrossum
Comment:
Fixed in both pickle.py (rev. 1.41) and cPickle.py (rev. 2.54).
I've also checked in tests for these and similar endcases.
-------------------------------------------------------
Date: 2000-Nov-27 14:36
By: tlau
Comment:
One more comment: binary-format pickles are not affected, only text-format
pickles. Thus the part of my patch that applies to the binary section of
the save_unicode function should not be applied.
-------------------------------------------------------
Date: 2000-Nov-27 14:35
By: lemburg
Comment:
Some background (no time to fix this myself):
When I added the Unicode handlers, I wanted to avoid the
problems that the string dump mechanism has with
quoted strings. The encodings used either carry length information
(in binary mode: UTF-8) or do not include the \n character
(in ascii mode: raw-unicode-escape encoding).
Unfortunately, the raw-unicode-escape codec does not
escape the newline character which is used by pickle
to break the input into tokens....
Proposed fix: change the encoding to "unicode-escape"
which doesn't have this problem. This will break code,
but only code that is already broken :-/
-------------------------------------------------------
Date: 2000-Nov-27 14:20
By: tlau
Comment:
Here's my proposed patch to Lib/pickle.py (cPickle should be changed
similarly):
--- /scratch/tlau/Python-2.0/Lib/pickle.py Mon Oct 16 14:49:51 2000
+++ pickle.py Mon Nov 27 14:07:01 2000
@@ -286,9 +286,9 @@
encoding = object.encode('utf-8')
l = len(encoding)
s = mdumps(l)[1:]
- self.write(BINUNICODE + s + encoding)
+ self.write(BINUNICODE + `s` + encoding)
else:
- self.write(UNICODE + object.encode('raw-unicode-escape') +
'\n')
+ self.write(UNICODE + `object.encode('raw-unicode-escape')` +
'\n')
memo_len = len(memo)
self.write(self.put(memo_len))
@@ -627,7 +627,12 @@
dispatch[BINSTRING] = load_binstring
def load_unicode(self):
- self.append(unicode(self.readline()[:-1],'raw-unicode-escape'))
+ rep = self.readline()[:-1]
+ if not self._is_string_secure(rep):
+ raise ValueError, "insecure string pickle"
+ rep = eval(rep,
+ {'__builtins__': {}}) # Let's be careful
+ self.append(unicode(rep, 'raw-unicode-escape'))
dispatch[UNICODE] = load_unicode
def load_binunicode(self):
-------------------------------------------------------
Date: 2000-Nov-27 14:14
By: gvanrossum
Comment:
Jim, do you have time to look into this?
-------------------------------------------------------
For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=123634&group_id=5470