[Python-ideas] An idea for a new pickling tool

Wed Apr 22 00:02:20 CEST 2009

Motivation
----------

Python's pickles use a custom format that has evolved over time
but they have five significant disadvantages:

    * it has lost its human readability and editability
    * is doesn't compress well
    * it isn't interoperable with other languages
    * it doesn't have the ability to enforce a schema
    * it is a major security risk for untrusted inputs

New idea
--------

Develop a solution using a mix of PyYAML, a python coded version of
Kwalify, optional compression using bz2, gzip, or zlib, and pretty
printing using pygments.

YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard
for data serialization.

PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of
the YAML standard.  It uses the YAML's application-specific tags and
Python's own copy/reduce logic to provide the same power as pickle itself.

Kwalify ( http://www.kuwata-lab.com/kwalify/ruby/users-guide.01.html )
is a schema validator written in Ruby and Java.  It defines a
YAML/JSON based schema definition for enforcing tight constraints
on incoming data.

The bz2, gzip, and zlib compression libraries are already built into
the language.

Pygments ( http://pygments.org/ ) is python based syntax highlighter
with builtin support for YAML.

Advantages
----------

* The format is simple enough to hand edit or to have lightweight
  applications emit valid pickles.  For example:

      print('Todo: [go to bank, pick up food, write code]')   # valid pickle

* To date, efforts to make pickles smaller have focused on creating new
  codes for every data type.  Instead, we can use the simple text formatting
  of YAML and let general purpose data compression utilities do their job
  (letting the user control the trade-offs between speed, space, and human
  readability):

      yaml.dump(data, compressor=None)  # fast, human readable, no compression
      yaml.dump(data, compressor=bz2)   # slowest, but best compression
      yaml.dump(data, compressor=zlib)  # medium speed and medium compression

* The current pickle tools makes it easy to exchange object trees between
  two Python processes.  The new tool would make it equally easy 
  to exchange object trees between processes running any of Python, Ruby, 
  Java, C/C++, Perl, C#, PHP, OCaml, Javascript, ActionScript, and Haskell.

* Ability to use a schema for enforcing a given object model and allowing
  full security.  Which would you rather run on untrusted data:

      data = yaml.load(myfile, schema=ListOfStrings)

  or

      data = pickle.load(myfile)

* Specification of a schema using YAML itself

  ListOfStrings (a schema written in yaml)
  ........................................
  type:   seq
  sequence:
    - type:   str

  Sample of valid input
  .....................
  - foo
  - bar
  - baz

  Note, schemas can be defined for very complex, nested object models and
  allow many kinds of constraints (unique items, enumerated list of allowable
  values, min/max allowable ranges for values, data type, maximum length,
  and names of regular Python classes that can be constructed).

* YAML is a superset of JSON, so the schema validation also works equally
  well with JSON encoded data.

What needs to be done
---------------------

* Combine the tools for a single, clean interface to C speed parsing
  of a data serialization standard, with optional compression, schema
  validation, and pretty printing.