[Python-checkins] python/nondist/sandbox/pickletools pickletools.py,1.6,1.7
tim_one@users.sourceforge.net
tim_one@users.sourceforge.net
Sat, 25 Jan 2003 19:58:08 -0800
Update of /cvsroot/python/python/nondist/sandbox/pickletools
In directory sc8-pr-cvs1:/tmp/cvs-serv20206
Modified Files:
pickletools.py
Log Message:
Added general blurbs about the pickle machine, and about pickle protocols.
Added an UP_TO_NEWLINE "number of bytes" value, for use in
ArgumentDescriptor objects.
Index: pickletools.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/pickletools/pickletools.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** pickletools.py 26 Jan 2003 03:06:46 -0000 1.6
--- pickletools.py 26 Jan 2003 03:58:06 -0000 1.7
***************
*** 12,15 ****
--- 12,98 ----
+ """
+ "A pickle" is a program for a virtual pickle machine (PM, but more accurately
+ called an unpickling machine). It's a sequence of opcodes, interpreted by the
+ PM, building an arbitrarily complex Python object.
+
+ For the most part, the PM is very simple: there are no looping, testing, or
+ conditional instructions, no arithmetic and no function calls. Opcodes are
+ executed once each, from first to last, until a STOP opcode is reached.
+
+ The PM has two data areas, "the stack" and "the memo".
+
+ Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
+ integer object on the stack, whose value is gotten from a decimal string
+ literal immediately following the INT opcode in the pickle bytestream. Other
+ opcodes take Python objects off the stack. The result of unpickling is
+ whatever object is left on the stack when the final STOP opcode is executed.
+
+ The memo is simply an array of objects, or it can be implemented as a dict
+ mapping little integers to objects. The memo serves as the PM's "long term
+ memory", and the little integers indexing the memo are akin to variable
+ names. Some opcodes pop a stack object into the memo at a given index,
+ and others push a memo object at a given index onto the stack again.
+
+ At heart, that's all the PM has. Subtleties arise for these reasons:
+
+ + Object identity. Objects can be arbitrarily complex, and subobjects
+ may be shared (for example, the list [a, a] refers to the same object a
+ twice). It can be vital that unpickling recreate an isomorphic object
+ graph, faithfully reproducing sharing.
+
+ + Recursive objects. For example, after "L = []; L.append(L)", L is a
+ list, and L[0] is the same list. This is related to the object identity
+ point, and some sequences of pickle opcodes are subtle in order to
+ get the right result in all cases.
+
+ + Things pickle doesn't know everything about. Examples of things pickle
+ does know everything about are Python's builtin scalar and container
+ types, like ints and tuples. They generally have opcodes dedicated to
+ them. For things like module references and instances of user-defined
+ classes, pickle's knowledge is limited. Historically, many enhancements
+ have been made to the pickle protocol in order to do a better (faster,
+ and/or more compact) job on those.
+
+ + Backward compatibility and micro-optimization. As explained below,
+ pickle opcodes never go away, not even when better ways to do a thing
+ get invented. The repertoire of the PM just keeps growing over time.
+ So, e.g., there are now six distinct opcodes for building a Python integer,
+ five of them devoted to "short" integers. Even so, the only way to pickle
+ a Python long int takes time quadratic in the number of digits, for both
+ pickling and unpickling. This isn't so much a subtlety as a source of
+ wearying complication.
+
+
+ Pickle protocols:
+
+ For compatibility, the meaning of a pickle opcode never changes. Instead new
+ pickle opcodes get added, and each version's unpickler can handle all the
+ pickle opcodes in all protocol versions to date. So old pickles continue to
+ be readable forever. The pickler can generally be told to restrict itself to
+ the subset of opcodes available under previous protocol versions too, so that
+ users can create pickles under the current version readable by older
+ versions. However, a pickle does not contain its version number embedded
+ within it. If an older unpickler tries to read a pickle using a later
+ protocol, the result is most likely an exception due to seeing an unknown (in
+ the older unpickler) opcode.
+
+ The original pickle used what's now called "protocol 0", and what was called
+ "text mode" before Python 2.3. The entire pickle bytestream is made up of
+ printable 7-bit ASCII characters, plus the newline character, in protocol 0.
+ That's why it was called text mode.
+
+ The second major set of additions is now called "protocol 1", and was called
+ "binary mode" before Python 2.3. This added many opcodes with arguments
+ consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
+ bytes. Binary mode pickles can be substantially smaller than equivalent
+ text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
+ int as 4 bytes following the opcode, which is cheaper to unpickle than the
+ (perhaps) 11-character decimal string attached to INT.
+
+ The third major set of additions came in Python 2.3, and is called "protocol
+ 2". XXX Write a short blurb when Guido figures out what they are <wink>. XXX
+ """
+
# Meta-rule: Descriptions are stored in instances of descriptor objects,
# with plain constructors. No meta-language is defined from which
***************
*** 20,24 ****
# Some pickle opcodes have an argument, following the opcode in the
# bytestream. An argument is of a specific type, described by an instance
! # of ArgumentDescriptor.
class ArgumentDescriptor(object):
--- 103,112 ----
# Some pickle opcodes have an argument, following the opcode in the
# bytestream. An argument is of a specific type, described by an instance
! # of ArgumentDescriptor. These are not to be confused with arguments taken
! # off the stack -- ArgumentDescriptor applie only to arguments embedded in
! # the opcode stream, immediately following an opcode.
!
! UP_TO_NEWLINE = -1 # represents the "number of bytes" consumed by an
! # argument delimited by the next newline character
class ArgumentDescriptor(object):
***************
*** 27,31 ****
'name',
! # length of argument, in bytes; an int; or None means variable-length
'n',
--- 115,120 ----
'name',
! # length of argument, in bytes; an int; UP_TO_NEWLINE means variable-
! # length, ending at the next occurrence of a newline character
'n',
***************
*** 43,47 ****
self.name = name
! assert n is None or (isinstance(n, int) and n >= 0)
self.n = n
--- 132,136 ----
self.name = name
! assert isinstance(n, int) and (n >= 0 or n is UP_TO_NEWLINE)
self.n = n
***************
*** 145,149 ****
stringnl = ArgumentDescriptor(
name='stringnl',
! n=None,
reader=read_stringnl,
doc="""A newline-terminated string.
--- 234,238 ----
stringnl = ArgumentDescriptor(
name='stringnl',
! n=UP_TO_NEWLINE,
reader=read_stringnl,
doc="""A newline-terminated string.
***************
*** 202,206 ****
decimalnl_short = ArgumentDescriptor(
name='decimalnl_short',
! n=None,
reader=read_decimalnl_short,
doc="""A newline-terminated decimal integer literal.
--- 291,295 ----
decimalnl_short = ArgumentDescriptor(
name='decimalnl_short',
! n=UP_TO_NEWLINE,
reader=read_decimalnl_short,
doc="""A newline-terminated decimal integer literal.
***************
*** 215,219 ****
decimalnl_long = ArgumentDescriptor(
name='decimalnl_long',
! n=None,
reader=read_decimalnl_long,
doc="""A newline-terminated decimal integer literal.
--- 304,308 ----
decimalnl_long = ArgumentDescriptor(
name='decimalnl_long',
! n=UP_TO_NEWLINE,
reader=read_decimalnl_long,
doc="""A newline-terminated decimal integer literal.
***************
*** 235,239 ****
floatnl = ArgumentDescriptor(
name='floatnl',
! n=None,
reader=read_floatnl,
doc="""A newline-terminated decimal floating literal.
--- 324,328 ----
floatnl = ArgumentDescriptor(
name='floatnl',
! n=UP_TO_NEWLINE,
reader=read_floatnl,
doc="""A newline-terminated decimal floating literal.