[Python-ideas] An idea for a new pickling tool

Wed Apr 22 00:26:24 CEST 2009

On Tue, Apr 21, 2009 at 3:02 PM, Raymond Hettinger <python at rcn.com> wrote:
> Motivation
> ----------
>
> Python's pickles use a custom format that has evolved over time
> but they have five significant disadvantages:
>
>   * it has lost its human readability and editability
>   * is doesn't compress well

Really? Or do you mean "it doesn't have built-in compression support"
? I don't expect that running bzip2 over a pickle would produce
unsatisfactory results, and the API supports reading from and writing
to streams.

Or do you just mean "the representation is too repetitive and bulky" ?

>   * it isn't interoperable with other languages
>   * it doesn't have the ability to enforce a schema
>   * it is a major security risk for untrusted inputs

I agree that pickle doesn't satisfy these. But then again, #1, #3 and
#4 were never part of its design goals. #5 is indeed a problem.

But I think there are existing solutions already. For example, I'd say
that XML+bzip2 satisfies all these already. If you want something a
little less verbose, I recommend looking at Google Protocol Buffers
(http://code.google.com/apis/protocolbuffers/), which have both a
compact binary format (though not compressed -- but you can easily
layer that) and a verbose text format. There's a nice Python-specific
tutorial (http://code.google.com/apis/protocolbuffers/docs/pythontutorial.html)
that also explains why you would use this.

--Guido

> New idea
> --------
>
> Develop a solution using a mix of PyYAML, a python coded version of
> Kwalify, optional compression using bz2, gzip, or zlib, and pretty
> printing using pygments.
>
> YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard
> for data serialization.
>
> PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of
> the YAML standard.  It uses the YAML's application-specific tags and
> Python's own copy/reduce logic to provide the same power as pickle itself.
>
> Kwalify ( http://www.kuwata-lab.com/kwalify/ruby/users-guide.01.html )
> is a schema validator written in Ruby and Java.  It defines a
> YAML/JSON based schema definition for enforcing tight constraints
> on incoming data.
>
> The bz2, gzip, and zlib compression libraries are already built into
> the language.
>
> Pygments ( http://pygments.org/ ) is python based syntax highlighter
> with builtin support for YAML.
>
>
> Advantages
> ----------
>
> * The format is simple enough to hand edit or to have lightweight
>  applications emit valid pickles.  For example:
>
>     print('Todo: [go to bank, pick up food, write code]')   # valid pickle
>
> * To date, efforts to make pickles smaller have focused on creating new
>  codes for every data type.  Instead, we can use the simple text formatting
>  of YAML and let general purpose data compression utilities do their job
>  (letting the user control the trade-offs between speed, space, and human
>  readability):
>
>     yaml.dump(data, compressor=None)  # fast, human readable, no compression
>     yaml.dump(data, compressor=bz2)   # slowest, but best compression
>     yaml.dump(data, compressor=zlib)  # medium speed and medium compression
>
> * The current pickle tools makes it easy to exchange object trees between
>  two Python processes.  The new tool would make it equally easy  to exchange
> object trees between processes running any of Python, Ruby,  Java, C/C++,
> Perl, C#, PHP, OCaml, Javascript, ActionScript, and Haskell.
>
> * Ability to use a schema for enforcing a given object model and allowing
>  full security.  Which would you rather run on untrusted data:
>
>     data = yaml.load(myfile, schema=ListOfStrings)
>
>  or
>
>     data = pickle.load(myfile)
>
> * Specification of a schema using YAML itself
>
>  ListOfStrings (a schema written in yaml)
>  ........................................
>  type:   seq
>  sequence:
>   - type:   str
>
>  Sample of valid input
>  .....................
>  - foo
>  - bar
>  - baz
>
>  Note, schemas can be defined for very complex, nested object models and
>  allow many kinds of constraints (unique items, enumerated list of allowable
>  values, min/max allowable ranges for values, data type, maximum length,
>  and names of regular Python classes that can be constructed).
>
> * YAML is a superset of JSON, so the schema validation also works equally
>  well with JSON encoded data.
>
> What needs to be done
> ---------------------
>
> * Combine the tools for a single, clean interface to C speed parsing
>  of a data serialization standard, with optional compression, schema
>  validation, and pretty printing.
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)