[Python-ideas] An idea for a new pickling tool

Fri Apr 24 07:19:59 CEST 2009

On Wed, Apr 22, 2009 at 6:10 AM, Jesse Noller <jnoller at gmail.com> wrote:
> On Tue, Apr 21, 2009 at 8:41 PM, Jesse Noller <jnoller at gmail.com> wrote:
>> On Tue, Apr 21, 2009 at 6:02 PM, Raymond Hettinger <python at rcn.com> wrote:
>>> Motivation
>>> ----------
>>>
>>> Python's pickles use a custom format that has evolved over time
>>> but they have five significant disadvantages:
>>>
>>>   * it has lost its human readability and editability
>>>   * is doesn't compress well
>>>   * it isn't interoperable with other languages
>>>   * it doesn't have the ability to enforce a schema
>>>   * it is a major security risk for untrusted inputs
>>>
>>>
>>> New idea
>>> --------
>>>
>>> Develop a solution using a mix of PyYAML, a python coded version of
>>> Kwalify, optional compression using bz2, gzip, or zlib, and pretty
>>> printing using pygments.
>>>
>>> YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard
>>> for data serialization.
>>>
>>> PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of
>>> the YAML standard.  It uses the YAML's application-specific tags and
>>> Python's own copy/reduce logic to provide the same power as pickle itself.
>>>
>>> Kwalify ( http://www.kuwata-lab.com/kwalify/ruby/users-guide.01.html )
>>> is a schema validator written in Ruby and Java.  It defines a
>>> YAML/JSON based schema definition for enforcing tight constraints
>>> on incoming data.
>>>
>>> The bz2, gzip, and zlib compression libraries are already built into
>>> the language.
>>>
>>> Pygments ( http://pygments.org/ ) is python based syntax highlighter
>>> with builtin support for YAML.
>>>
>>>
>>> Advantages
>>> ----------
>>>
>>> * The format is simple enough to hand edit or to have lightweight
>>>  applications emit valid pickles.  For example:
>>>
>>>     print('Todo: [go to bank, pick up food, write code]')   # valid pickle
>>>
>>> * To date, efforts to make pickles smaller have focused on creating new
>>>  codes for every data type.  Instead, we can use the simple text formatting
>>>  of YAML and let general purpose data compression utilities do their job
>>>  (letting the user control the trade-offs between speed, space, and human
>>>  readability):
>>>
>>>     yaml.dump(data, compressor=None)  # fast, human readable, no compression
>>>     yaml.dump(data, compressor=bz2)   # slowest, but best compression
>>>     yaml.dump(data, compressor=zlib)  # medium speed and medium compression
>>>
>>> * The current pickle tools makes it easy to exchange object trees between
>>>  two Python processes.  The new tool would make it equally easy  to exchange
>>> object trees between processes running any of Python, Ruby,  Java, C/C++,
>>> Perl, C#, PHP, OCaml, Javascript, ActionScript, and Haskell.
>>>
>>> * Ability to use a schema for enforcing a given object model and allowing
>>>  full security.  Which would you rather run on untrusted data:
>>>
>>>     data = yaml.load(myfile, schema=ListOfStrings)
>>>
>>>  or
>>>
>>>     data = pickle.load(myfile)
>>>
>>> * Specification of a schema using YAML itself
>>>
>>>  ListOfStrings (a schema written in yaml)
>>>  ........................................
>>>  type:   seq
>>>  sequence:
>>>   - type:   str
>>>
>>>  Sample of valid input
>>>  .....................
>>>  - foo
>>>  - bar
>>>  - baz
>>>
>>>  Note, schemas can be defined for very complex, nested object models and
>>>  allow many kinds of constraints (unique items, enumerated list of allowable
>>>  values, min/max allowable ranges for values, data type, maximum length,
>>>  and names of regular Python classes that can be constructed).
>>>
>>> * YAML is a superset of JSON, so the schema validation also works equally
>>>  well with JSON encoded data.
>>>
>>> What needs to be done
>>> ---------------------
>>>
>>> * Combine the tools for a single, clean interface to C speed parsing
>>>  of a data serialization standard, with optional compression, schema
>>>  validation, and pretty printing.
>>
>> Just to add to this, I remembered someone recently did a simple
>> benchmark of thift/JSON/YAML/Protocol Buffers, here are the links:
>>
>> http://www.bouncybouncy.net/ramblings/posts/thrift_and_protocol_buffers/
>> http://www.bouncybouncy.net/ramblings/posts/more_on_json_vs_thrift_and_protocol_buffers/
>> http://www.bouncybouncy.net/ramblings/posts/json_vs_thrift_and_protocol_buffers_round_2/
>>
>> Without digging into the numbers too much, it's worth noting that
>> PyYAML is written in pure python but also has Libyaml
>> (http://pyyaml.org/wiki/LibYAML) bindings for speed. When I get a
>> chance, I can run the same test(s) with both the pure-python
>> implementation and the libyaml one as well as see how much the speedup
>> is.
>>
>
> Speaking of benchmarks, last night I took the first benchmark cited in
> the links above, and with some work I ran the same benchmark with
> PyYAML (pure python) and PyYAML with libyaml (the C version). The
> PyYAML -> libyaml bindings require Cython right now, but here are the
> numbers.
>
> I removed thift and proto buffers, as I wanted to focus on YAML/JSON right now:
>
> 5000 total records (0.510s)
>
> ser_json             (0.030s) 718147 bytes
> ser_cjson            (0.030s) 718147 bytes
> ser_yaml             (6.230s) 623147 bytes
> ser_cyaml            (2.040s) 623147 bytes
>
> ser_json_compressed    (0.100s) 292987 bytes
> ser_cjson_compressed   (0.110s) 292987 bytes
> ser_yaml_compressed    (6.310s) 291018 bytes
> ser_cyaml_compressed   (2.140s) 291018 bytes
>
> serde_json           (0.050s)
> serde_cjson          (0.050s)
> serde_yaml           (19.020s)
> serde_cyaml          (4.460s)
>
> Running the second benchmark (the integer one) I see:
>
> 10000 total records (0.130s)
>
> ser_json             (0.040s) 680749 bytes
> ser_cjson            (0.030s) 680749 bytes
> ser_yaml             (8.250s) 610749 bytes
> ser_cyaml            (3.040s) 610749 bytes
>
> ser_json_compressed    (0.100s) 124924 bytes
> ser_cjson_compressed   (0.090s) 124924 bytes
> ser_yaml_compressed    (8.320s) 121090 bytes
> ser_cyaml_compressed   (3.110s) 121090 bytes
>
> serde_json           (0.060s)
> serde_cjson          (0.070s)
> serde_yaml           (24.190s)
> serde_cyaml          (6.690s)
>
>
> So yes, the pure python numbers for yaml (_yaml) are pretty bad; the
> libyaml (_cyaml) numbers are significantly improved, but not as fast
> as JSON/CJSON.

Saying "not as fast" is a bit misleading.  Roughly 100x slower than
json is a more precise description, and is one of the major reasons
why a bunch of people stick with json rather than yaml.

> One thing to note in this discussion as others have pointed out, while
> JSON itself it awfully fast/nice, it lacks some of the capabilities of
> YAML, for example certain objects can not be represented in JSON.
> Additionally, if we want to simply state "objects which you desire to
> be compatible with JSON have the following restrictions" we can - this
> means we can also leverage things within PyYAML which are also
> nice-to-haves, for example the !!python additions.

In fact, no custom objects are generally representable with
json...unless you use the custom encoders/decoders like simplejson
has.  However, in the times when I've used json, being able to store
arbitrary Python objects wasn't a huge chore.  I just threw a
'to_json()' method on every object that I wanted to serialize, they
knew about their contents, and would check for 'to_json()' methods as
necessary.  Deserialization just meant passing the lists,
dictionaries, etc., to a base '.from_json()' classmethod, which did
all of the right stuff.  It was trivial, it worked, and it was fast.

> Picking YAML in this case means we get all of the YAML syntax,
> objects, etc - and if consumers want to stick with JSON compatibility,
> we could add a dump(canonical=True, compatibility=JSON) or somesuch
> flag.

My vote is to keep it simple and fast.  JSON satisfies that.  YAML
doesn't.  While I appreciate the desire to be able to store recursive
references, I don't believe it's necessary in the general case.

 - Josiah