[Python-ideas] An idea for a new pickling tool
Raymond Hettinger
python at rcn.com
Wed Apr 22 00:02:20 CEST 2009
Motivation
----------
Python's pickles use a custom format that has evolved over time
but they have five significant disadvantages:
* it has lost its human readability and editability
* is doesn't compress well
* it isn't interoperable with other languages
* it doesn't have the ability to enforce a schema
* it is a major security risk for untrusted inputs
New idea
--------
Develop a solution using a mix of PyYAML, a python coded version of
Kwalify, optional compression using bz2, gzip, or zlib, and pretty
printing using pygments.
YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard
for data serialization.
PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of
the YAML standard. It uses the YAML's application-specific tags and
Python's own copy/reduce logic to provide the same power as pickle itself.
Kwalify ( http://www.kuwata-lab.com/kwalify/ruby/users-guide.01.html )
is a schema validator written in Ruby and Java. It defines a
YAML/JSON based schema definition for enforcing tight constraints
on incoming data.
The bz2, gzip, and zlib compression libraries are already built into
the language.
Pygments ( http://pygments.org/ ) is python based syntax highlighter
with builtin support for YAML.
Advantages
----------
* The format is simple enough to hand edit or to have lightweight
applications emit valid pickles. For example:
print('Todo: [go to bank, pick up food, write code]') # valid pickle
* To date, efforts to make pickles smaller have focused on creating new
codes for every data type. Instead, we can use the simple text formatting
of YAML and let general purpose data compression utilities do their job
(letting the user control the trade-offs between speed, space, and human
readability):
yaml.dump(data, compressor=None) # fast, human readable, no compression
yaml.dump(data, compressor=bz2) # slowest, but best compression
yaml.dump(data, compressor=zlib) # medium speed and medium compression
* The current pickle tools makes it easy to exchange object trees between
two Python processes. The new tool would make it equally easy
to exchange object trees between processes running any of Python, Ruby,
Java, C/C++, Perl, C#, PHP, OCaml, Javascript, ActionScript, and Haskell.
* Ability to use a schema for enforcing a given object model and allowing
full security. Which would you rather run on untrusted data:
data = yaml.load(myfile, schema=ListOfStrings)
or
data = pickle.load(myfile)
* Specification of a schema using YAML itself
ListOfStrings (a schema written in yaml)
........................................
type: seq
sequence:
- type: str
Sample of valid input
.....................
- foo
- bar
- baz
Note, schemas can be defined for very complex, nested object models and
allow many kinds of constraints (unique items, enumerated list of allowable
values, min/max allowable ranges for values, data type, maximum length,
and names of regular Python classes that can be constructed).
* YAML is a superset of JSON, so the schema validation also works equally
well with JSON encoded data.
What needs to be done
---------------------
* Combine the tools for a single, clean interface to C speed parsing
of a data serialization standard, with optional compression, schema
validation, and pretty printing.
More information about the Python-ideas
mailing list