[Python-ideas] An idea for a new pickling tool
Jesse Noller
jnoller at gmail.com
Wed Apr 22 00:10:24 CEST 2009
On Apr 21, 2009, at 6:02 PM, "Raymond Hettinger" <python at rcn.com> wrote:
> Motivation
> ----------
>
> Python's pickles use a custom format that has evolved over time
> but they have five significant disadvantages:
>
> * it has lost its human readability and editability
> * is doesn't compress well
> * it isn't interoperable with other languages
> * it doesn't have the ability to enforce a schema
> * it is a major security risk for untrusted inputs
>
>
> New idea
> --------
>
> Develop a solution using a mix of PyYAML, a python coded version of
> Kwalify, optional compression using bz2, gzip, or zlib, and pretty
> printing using pygments.
>
> YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard
> for data serialization.
>
> PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of
> the YAML standard. It uses the YAML's application-specific tags and
> Python's own copy/reduce logic to provide the same power as pickle
> itself.
>
> Kwalify ( http://www.kuwata-lab.com/kwalify/ruby/users-guide.01.html )
> is a schema validator written in Ruby and Java. It defines a
> YAML/JSON based schema definition for enforcing tight constraints
> on incoming data.
>
> The bz2, gzip, and zlib compression libraries are already built into
> the language.
>
> Pygments ( http://pygments.org/ ) is python based syntax highlighter
> with builtin support for YAML.
>
>
> Advantages
> ----------
>
> * The format is simple enough to hand edit or to have lightweight
> applications emit valid pickles. For example:
>
> print('Todo: [go to bank, pick up food, write code]') # valid
> pickle
>
> * To date, efforts to make pickles smaller have focused on creating
> new
> codes for every data type. Instead, we can use the simple text
> formatting
> of YAML and let general purpose data compression utilities do their
> job
> (letting the user control the trade-offs between speed, space, and
> human
> readability):
>
> yaml.dump(data, compressor=None) # fast, human readable, no
> compression
> yaml.dump(data, compressor=bz2) # slowest, but best compression
> yaml.dump(data, compressor=zlib) # medium speed and medium
> compression
>
> * The current pickle tools makes it easy to exchange object trees
> between
> two Python processes. The new tool would make it equally easy to
> exchange object trees between processes running any of Python,
> Ruby, Java, C/C++, Perl, C#, PHP, OCaml, Javascript, ActionScript,
> and Haskell.
>
> * Ability to use a schema for enforcing a given object model and
> allowing
> full security. Which would you rather run on untrusted data:
>
> data = yaml.load(myfile, schema=ListOfStrings)
>
> or
>
> data = pickle.load(myfile)
>
> * Specification of a schema using YAML itself
>
> ListOfStrings (a schema written in yaml)
> ........................................
> type: seq
> sequence:
> - type: str
>
> Sample of valid input
> .....................
> - foo
> - bar
> - baz
>
> Note, schemas can be defined for very complex, nested object models
> and
> allow many kinds of constraints (unique items, enumerated list of
> allowable
> values, min/max allowable ranges for values, data type, maximum
> length,
> and names of regular Python classes that can be constructed).
>
> * YAML is a superset of JSON, so the schema validation also works
> equally
> well with JSON encoded data.
>
> What needs to be done
> ---------------------
>
> * Combine the tools for a single, clean interface to C speed parsing
> of a data serialization standard, with optional compression, schema
> validation, and pretty printing.
>
A huge +1 from me, I've used YAML quite a bit, and as a cross language
communications format it's quite nice.
Jesse
More information about the Python-ideas
mailing list