[Python-checkins] python/nondist/sandbox/pickletools pickletools.py,1.6,1.7

tim_one@users.sourceforge.net tim_one@users.sourceforge.net
Sat, 25 Jan 2003 19:58:08 -0800


Update of /cvsroot/python/python/nondist/sandbox/pickletools
In directory sc8-pr-cvs1:/tmp/cvs-serv20206

Modified Files:
	pickletools.py 
Log Message:
Added general blurbs about the pickle machine, and about pickle protocols.

Added an UP_TO_NEWLINE "number of bytes" value, for use in
ArgumentDescriptor objects.


Index: pickletools.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/pickletools/pickletools.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** pickletools.py	26 Jan 2003 03:06:46 -0000	1.6
--- pickletools.py	26 Jan 2003 03:58:06 -0000	1.7
***************
*** 12,15 ****
--- 12,98 ----
  
  
+ """
+ "A pickle" is a program for a virtual pickle machine (PM, but more accurately
+ called an unpickling machine).  It's a sequence of opcodes, interpreted by the
+ PM, building an arbitrarily complex Python object.
+ 
+ For the most part, the PM is very simple:  there are no looping, testing, or
+ conditional instructions, no arithmetic and no function calls.  Opcodes are
+ executed once each, from first to last, until a STOP opcode is reached.
+ 
+ The PM has two data areas, "the stack" and "the memo".
+ 
+ Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
+ integer object on the stack, whose value is gotten from a decimal string
+ literal immediately following the INT opcode in the pickle bytestream.  Other
+ opcodes take Python objects off the stack.  The result of unpickling is
+ whatever object is left on the stack when the final STOP opcode is executed.
+ 
+ The memo is simply an array of objects, or it can be implemented as a dict
+ mapping little integers to objects.  The memo serves as the PM's "long term
+ memory", and the little integers indexing the memo are akin to variable
+ names.  Some opcodes pop a stack object into the memo at a given index,
+ and others push a memo object at a given index onto the stack again.
+ 
+ At heart, that's all the PM has.  Subtleties arise for these reasons:
+ 
+ + Object identity.  Objects can be arbitrarily complex, and subobjects
+   may be shared (for example, the list [a, a] refers to the same object a
+   twice).  It can be vital that unpickling recreate an isomorphic object
+   graph, faithfully reproducing sharing.
+ 
+ + Recursive objects.  For example, after "L = []; L.append(L)", L is a
+   list, and L[0] is the same list.  This is related to the object identity
+   point, and some sequences of pickle opcodes are subtle in order to
+   get the right result in all cases.
+ 
+ + Things pickle doesn't know everything about.  Examples of things pickle
+   does know everything about are Python's builtin scalar and container
+   types, like ints and tuples.  They generally have opcodes dedicated to
+   them.  For things like module references and instances of user-defined
+   classes, pickle's knowledge is limited.  Historically, many enhancements
+   have been made to the pickle protocol in order to do a better (faster,
+   and/or more compact) job on those.
+ 
+ + Backward compatibility and micro-optimization.  As explained below,
+   pickle opcodes never go away, not even when better ways to do a thing
+   get invented.  The repertoire of the PM just keeps growing over time.
+   So, e.g., there are now six distinct opcodes for building a Python integer,
+   five of them devoted to "short" integers.  Even so, the only way to pickle
+   a Python long int takes time quadratic in the number of digits, for both
+   pickling and unpickling.  This isn't so much a subtlety as a source of
+   wearying complication.
+ 
+ 
+ Pickle protocols:
+ 
+ For compatibility, the meaning of a pickle opcode never changes.  Instead new
+ pickle opcodes get added, and each version's unpickler can handle all the
+ pickle opcodes in all protocol versions to date.  So old pickles continue to
+ be readable forever.  The pickler can generally be told to restrict itself to
+ the subset of opcodes available under previous protocol versions too, so that
+ users can create pickles under the current version readable by older
+ versions.  However, a pickle does not contain its version number embedded
+ within it.  If an older unpickler tries to read a pickle using a later
+ protocol, the result is most likely an exception due to seeing an unknown (in
+ the older unpickler) opcode.
+ 
+ The original pickle used what's now called "protocol 0", and what was called
+ "text mode" before Python 2.3.  The entire pickle bytestream is made up of
+ printable 7-bit ASCII characters, plus the newline character, in protocol 0.
+ That's why it was called text mode.
+ 
+ The second major set of additions is now called "protocol 1", and was called
+ "binary mode" before Python 2.3.  This added many opcodes with arguments
+ consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
+ bytes.  Binary mode pickles can be substantially smaller than equivalent
+ text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
+ int as 4 bytes following the opcode, which is cheaper to unpickle than the
+ (perhaps) 11-character decimal string attached to INT.
+ 
+ The third major set of additions came in Python 2.3, and is called "protocol
+ 2".  XXX Write a short blurb when Guido figures out what they are <wink>. XXX
+ """
+ 
  # Meta-rule:  Descriptions are stored in instances of descriptor objects,
  # with plain constructors.  No meta-language is defined from which
***************
*** 20,24 ****
  # Some pickle opcodes have an argument, following the opcode in the
  # bytestream.  An argument is of a specific type, described by an instance
! # of ArgumentDescriptor.
  
  class ArgumentDescriptor(object):
--- 103,112 ----
  # Some pickle opcodes have an argument, following the opcode in the
  # bytestream.  An argument is of a specific type, described by an instance
! # of ArgumentDescriptor.  These are not to be confused with arguments taken
! # off the stack -- ArgumentDescriptor applie only to arguments embedded in
! # the opcode stream, immediately following an opcode.
! 
! UP_TO_NEWLINE = -1   # represents the "number of bytes" consumed by an
!                      # argument delimited by the next newline character
  
  class ArgumentDescriptor(object):
***************
*** 27,31 ****
          'name',
  
!         # length of argument, in bytes; an int; or None means variable-length
          'n',
  
--- 115,120 ----
          'name',
  
!         # length of argument, in bytes; an int; UP_TO_NEWLINE means variable-
!         # length, ending at the next occurrence of a newline character
          'n',
  
***************
*** 43,47 ****
          self.name = name
  
!         assert n is None or (isinstance(n, int) and n >= 0)
          self.n = n
  
--- 132,136 ----
          self.name = name
  
!         assert isinstance(n, int) and (n >= 0 or n is UP_TO_NEWLINE)
          self.n = n
  
***************
*** 145,149 ****
  stringnl = ArgumentDescriptor(
                 name='stringnl',
!                n=None,
                 reader=read_stringnl,
                 doc="""A newline-terminated string.
--- 234,238 ----
  stringnl = ArgumentDescriptor(
                 name='stringnl',
!                n=UP_TO_NEWLINE,
                 reader=read_stringnl,
                 doc="""A newline-terminated string.
***************
*** 202,206 ****
  decimalnl_short = ArgumentDescriptor(
                        name='decimalnl_short',
!                       n=None,
                        reader=read_decimalnl_short,
                        doc="""A newline-terminated decimal integer literal.
--- 291,295 ----
  decimalnl_short = ArgumentDescriptor(
                        name='decimalnl_short',
!                       n=UP_TO_NEWLINE,
                        reader=read_decimalnl_short,
                        doc="""A newline-terminated decimal integer literal.
***************
*** 215,219 ****
  decimalnl_long = ArgumentDescriptor(
                       name='decimalnl_long',
!                      n=None,
                       reader=read_decimalnl_long,
                       doc="""A newline-terminated decimal integer literal.
--- 304,308 ----
  decimalnl_long = ArgumentDescriptor(
                       name='decimalnl_long',
!                      n=UP_TO_NEWLINE,
                       reader=read_decimalnl_long,
                       doc="""A newline-terminated decimal integer literal.
***************
*** 235,239 ****
  floatnl = ArgumentDescriptor(
                name='floatnl',
!               n=None,
                reader=read_floatnl,
                doc="""A newline-terminated decimal floating literal.
--- 324,328 ----
  floatnl = ArgumentDescriptor(
                name='floatnl',
!               n=UP_TO_NEWLINE,
                reader=read_floatnl,
                doc="""A newline-terminated decimal floating literal.