[Python-ideas] An idea for a new pickling tool

Wed Apr 22 15:10:31 CEST 2009

On Tue, Apr 21, 2009 at 8:41 PM, Jesse Noller <jnoller at gmail.com> wrote:
> On Tue, Apr 21, 2009 at 6:02 PM, Raymond Hettinger <python at rcn.com> wrote:
>> Motivation
>> ----------
>>
>> Python's pickles use a custom format that has evolved over time
>> but they have five significant disadvantages:
>>
>>   * it has lost its human readability and editability
>>   * is doesn't compress well
>>   * it isn't interoperable with other languages
>>   * it doesn't have the ability to enforce a schema
>>   * it is a major security risk for untrusted inputs
>>
>>
>> New idea
>> --------
>>
>> Develop a solution using a mix of PyYAML, a python coded version of
>> Kwalify, optional compression using bz2, gzip, or zlib, and pretty
>> printing using pygments.
>>
>> YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard
>> for data serialization.
>>
>> PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of
>> the YAML standard.  It uses the YAML's application-specific tags and
>> Python's own copy/reduce logic to provide the same power as pickle itself.
>>
>> Kwalify ( http://www.kuwata-lab.com/kwalify/ruby/users-guide.01.html )
>> is a schema validator written in Ruby and Java.  It defines a
>> YAML/JSON based schema definition for enforcing tight constraints
>> on incoming data.
>>
>> The bz2, gzip, and zlib compression libraries are already built into
>> the language.
>>
>> Pygments ( http://pygments.org/ ) is python based syntax highlighter
>> with builtin support for YAML.
>>
>>
>> Advantages
>> ----------
>>
>> * The format is simple enough to hand edit or to have lightweight
>>  applications emit valid pickles.  For example:
>>
>>     print('Todo: [go to bank, pick up food, write code]')   # valid pickle
>>
>> * To date, efforts to make pickles smaller have focused on creating new
>>  codes for every data type.  Instead, we can use the simple text formatting
>>  of YAML and let general purpose data compression utilities do their job
>>  (letting the user control the trade-offs between speed, space, and human
>>  readability):
>>
>>     yaml.dump(data, compressor=None)  # fast, human readable, no compression
>>     yaml.dump(data, compressor=bz2)   # slowest, but best compression
>>     yaml.dump(data, compressor=zlib)  # medium speed and medium compression
>>
>> * The current pickle tools makes it easy to exchange object trees between
>>  two Python processes.  The new tool would make it equally easy  to exchange
>> object trees between processes running any of Python, Ruby,  Java, C/C++,
>> Perl, C#, PHP, OCaml, Javascript, ActionScript, and Haskell.
>>
>> * Ability to use a schema for enforcing a given object model and allowing
>>  full security.  Which would you rather run on untrusted data:
>>
>>     data = yaml.load(myfile, schema=ListOfStrings)
>>
>>  or
>>
>>     data = pickle.load(myfile)
>>
>> * Specification of a schema using YAML itself
>>
>>  ListOfStrings (a schema written in yaml)
>>  ........................................
>>  type:   seq
>>  sequence:
>>   - type:   str
>>
>>  Sample of valid input
>>  .....................
>>  - foo
>>  - bar
>>  - baz
>>
>>  Note, schemas can be defined for very complex, nested object models and
>>  allow many kinds of constraints (unique items, enumerated list of allowable
>>  values, min/max allowable ranges for values, data type, maximum length,
>>  and names of regular Python classes that can be constructed).
>>
>> * YAML is a superset of JSON, so the schema validation also works equally
>>  well with JSON encoded data.
>>
>> What needs to be done
>> ---------------------
>>
>> * Combine the tools for a single, clean interface to C speed parsing
>>  of a data serialization standard, with optional compression, schema
>>  validation, and pretty printing.
>
> Just to add to this, I remembered someone recently did a simple
> benchmark of thift/JSON/YAML/Protocol Buffers, here are the links:
>
> http://www.bouncybouncy.net/ramblings/posts/thrift_and_protocol_buffers/
> http://www.bouncybouncy.net/ramblings/posts/more_on_json_vs_thrift_and_protocol_buffers/
> http://www.bouncybouncy.net/ramblings/posts/json_vs_thrift_and_protocol_buffers_round_2/
>
> Without digging into the numbers too much, it's worth noting that
> PyYAML is written in pure python but also has Libyaml
> (http://pyyaml.org/wiki/LibYAML) bindings for speed. When I get a
> chance, I can run the same test(s) with both the pure-python
> implementation and the libyaml one as well as see how much the speedup
> is.
>

Speaking of benchmarks, last night I took the first benchmark cited in
the links above, and with some work I ran the same benchmark with
PyYAML (pure python) and PyYAML with libyaml (the C version). The
PyYAML -> libyaml bindings require Cython right now, but here are the
numbers.

I removed thift and proto buffers, as I wanted to focus on YAML/JSON right now:

5000 total records (0.510s)

ser_json             (0.030s) 718147 bytes
ser_cjson            (0.030s) 718147 bytes
ser_yaml             (6.230s) 623147 bytes
ser_cyaml            (2.040s) 623147 bytes

ser_json_compressed    (0.100s) 292987 bytes
ser_cjson_compressed   (0.110s) 292987 bytes
ser_yaml_compressed    (6.310s) 291018 bytes
ser_cyaml_compressed   (2.140s) 291018 bytes

serde_json           (0.050s)
serde_cjson          (0.050s)
serde_yaml           (19.020s)
serde_cyaml          (4.460s)

Running the second benchmark (the integer one) I see:

10000 total records (0.130s)

ser_json             (0.040s) 680749 bytes
ser_cjson            (0.030s) 680749 bytes
ser_yaml             (8.250s) 610749 bytes
ser_cyaml            (3.040s) 610749 bytes

ser_json_compressed    (0.100s) 124924 bytes
ser_cjson_compressed   (0.090s) 124924 bytes
ser_yaml_compressed    (8.320s) 121090 bytes
ser_cyaml_compressed   (3.110s) 121090 bytes

serde_json           (0.060s)
serde_cjson          (0.070s)
serde_yaml           (24.190s)
serde_cyaml          (6.690s)

So yes, the pure python numbers for yaml (_yaml) are pretty bad; the
libyaml (_cyaml) numbers are significantly improved, but not as fast
as JSON/CJSON.

One thing to note in this discussion as others have pointed out, while
JSON itself it awfully fast/nice, it lacks some of the capabilities of
YAML, for example certain objects can not be represented in JSON.
Additionally, if we want to simply state "objects which you desire to
be compatible with JSON have the following restrictions" we can - this
means we can also leverage things within PyYAML which are also
nice-to-haves, for example the !!python additions.

Picking YAML in this case means we get all of the YAML syntax,
objects, etc - and if consumers want to stick with JSON compatibility,
we could add a dump(canonical=True, compatibility=JSON) or somesuch
flag.

jesse