Mailman 3 An idea for a new pickling tool - Python-ideas

newer
Re: [Python-ideas] Revised**12 PEP...

An idea for a new pickling tool

older
Re: [Python-ideas] A conditional...

Raymond Hettinger

April 22, 2009

12:02 a.m.

Motivation ---------- Python's pickles use a custom format that has evolved over time but they have five significant disadvantages: * it has lost its human readability and editability * is doesn't compress well * it isn't interoperable with other languages * it doesn't have the ability to enforce a schema * it is a major security risk for untrusted inputs New idea -------- Develop a solution using a mix of PyYAML, a python coded version of Kwalify, optional compression using bz2, gzip, or zlib, and pretty printing using pygments. YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard for data serialization. PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of the YAML standard. It uses the YAML's application-specific tags and Python's own copy/reduce logic to provide the same power as pickle itself. Kwalify ( http://www.kuwata-lab.com/kwalify/ruby/users-guide.01.html ) is a schema validator written in Ruby and Java. It defines a YAML/JSON based schema definition for enforcing tight constraints on incoming data. The bz2, gzip, and zlib compression libraries are already built into the language. Pygments ( http://pygments.org/ ) is python based syntax highlighter with builtin support for YAML. Advantages ---------- * The format is simple enough to hand edit or to have lightweight applications emit valid pickles. For example: print('Todo: [go to bank, pick up food, write code]') # valid pickle * To date, efforts to make pickles smaller have focused on creating new codes for every data type. Instead, we can use the simple text formatting of YAML and let general purpose data compression utilities do their job (letting the user control the trade-offs between speed, space, and human readability): yaml.dump(data, compressor=None) # fast, human readable, no compression yaml.dump(data, compressor=bz2) # slowest, but best compression yaml.dump(data, compressor=zlib) # medium speed and medium compression * The current pickle tools makes it easy to exchange object trees between two Python processes. The new tool would make it equally easy to exchange object trees between processes running any of Python, Ruby, Java, C/C++, Perl, C#, PHP, OCaml, Javascript, ActionScript, and Haskell. * Ability to use a schema for enforcing a given object model and allowing full security. Which would you rather run on untrusted data: data = yaml.load(myfile, schema=ListOfStrings) or data = pickle.load(myfile) * Specification of a schema using YAML itself ListOfStrings (a schema written in yaml) ........................................ type: seq sequence: - type: str Sample of valid input ..................... - foo - bar - baz Note, schemas can be defined for very complex, nested object models and allow many kinds of constraints (unique items, enumerated list of allowable values, min/max allowable ranges for values, data type, maximum length, and names of regular Python classes that can be constructed). * YAML is a superset of JSON, so the schema validation also works equally well with JSON encoded data. What needs to be done --------------------- * Combine the tools for a single, clean interface to C speed parsing of a data serialization standard, with optional compression, schema validation, and pretty printing.

Show replies by date

Jesse Noller

April 2009

12:10 a.m.

On Apr 21, 2009, at 6:02 PM, "Raymond Hettinger" <python@rcn.com> wrote:

...

Motivation ----------

Python's pickles use a custom format that has evolved over time but they have five significant disadvantages:

* it has lost its human readability and editability * is doesn't compress well * it isn't interoperable with other languages * it doesn't have the ability to enforce a schema * it is a major security risk for untrusted inputs

New idea --------

Develop a solution using a mix of PyYAML, a python coded version of Kwalify, optional compression using bz2, gzip, or zlib, and pretty printing using pygments.

YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard for data serialization.

PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of the YAML standard. It uses the YAML's application-specific tags and Python's own copy/reduce logic to provide the same power as pickle itself.

Kwalify ( http://www.kuwata-lab.com/kwalify/ruby/users-guide.01.html ) is a schema validator written in Ruby and Java. It defines a YAML/JSON based schema definition for enforcing tight constraints on incoming data.

The bz2, gzip, and zlib compression libraries are already built into the language.

Pygments ( http://pygments.org/ ) is python based syntax highlighter with builtin support for YAML.

Advantages ----------

* The format is simple enough to hand edit or to have lightweight applications emit valid pickles. For example:

print('Todo: [go to bank, pick up food, write code]') # valid pickle

* To date, efforts to make pickles smaller have focused on creating new codes for every data type. Instead, we can use the simple text formatting of YAML and let general purpose data compression utilities do their job (letting the user control the trade-offs between speed, space, and human readability):

yaml.dump(data, compressor=None) # fast, human readable, no compression yaml.dump(data, compressor=bz2) # slowest, but best compression yaml.dump(data, compressor=zlib) # medium speed and medium compression

* The current pickle tools makes it easy to exchange object trees between two Python processes. The new tool would make it equally easy to exchange object trees between processes running any of Python, Ruby, Java, C/C++, Perl, C#, PHP, OCaml, Javascript, ActionScript, and Haskell.

* Ability to use a schema for enforcing a given object model and allowing full security. Which would you rather run on untrusted data:

data = yaml.load(myfile, schema=ListOfStrings)

or

data = pickle.load(myfile)

* Specification of a schema using YAML itself

ListOfStrings (a schema written in yaml) ........................................ type: seq sequence: - type: str

Sample of valid input ..................... - foo - bar - baz

Note, schemas can be defined for very complex, nested object models and allow many kinds of constraints (unique items, enumerated list of allowable values, min/max allowable ranges for values, data type, maximum length, and names of regular Python classes that can be constructed).

* YAML is a superset of JSON, so the schema validation also works equally well with JSON encoded data.

What needs to be done ---------------------

* Combine the tools for a single, clean interface to C speed parsing of a data serialization standard, with optional compression, schema validation, and pretty printing.

A huge +1 from me, I've used YAML quite a bit, and as a cross language communications format it's quite nice. Jesse

Guido van Rossum

12:26 a.m.

On Tue, Apr 21, 2009 at 3:02 PM, Raymond Hettinger <python@rcn.com> wrote:

...

Really? Or do you mean "it doesn't have built-in compression support" ? I don't expect that running bzip2 over a pickle would produce unsatisfactory results, and the API supports reading from and writing to streams. Or do you just mean "the representation is too repetitive and bulky" ?

...

I agree that pickle doesn't satisfy these. But then again, #1, #3 and #4 were never part of its design goals. #5 is indeed a problem. But I think there are existing solutions already. For example, I'd say that XML+bzip2 satisfies all these already. If you want something a little less verbose, I recommend looking at Google Protocol Buffers (http://code.google.com/apis/protocolbuffers/), which have both a compact binary format (though not compressed -- but you can easily layer that) and a verbose text format. There's a nice Python-specific tutorial (http://code.google.com/apis/protocolbuffers/docs/pythontutorial.html) that also explains why you would use this. --Guido

...

New idea --------

Develop a solution using a mix of PyYAML, a python coded version of Kwalify, optional compression using bz2, gzip, or zlib, and pretty printing using pygments.

YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard for data serialization.

PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of the YAML standard. It uses the YAML's application-specific tags and Python's own copy/reduce logic to provide the same power as pickle itself.

Kwalify ( http://www.kuwata-lab.com/kwalify/ruby/users-guide.01.html ) is a schema validator written in Ruby and Java. It defines a YAML/JSON based schema definition for enforcing tight constraints on incoming data.

The bz2, gzip, and zlib compression libraries are already built into the language.

Pygments ( http://pygments.org/ ) is python based syntax highlighter with builtin support for YAML.

Advantages ----------

* The format is simple enough to hand edit or to have lightweight applications emit valid pickles. For example:

print('Todo: [go to bank, pick up food, write code]') # valid pickle

* To date, efforts to make pickles smaller have focused on creating new codes for every data type. Instead, we can use the simple text formatting of YAML and let general purpose data compression utilities do their job (letting the user control the trade-offs between speed, space, and human readability):

yaml.dump(data, compressor=None) # fast, human readable, no compression yaml.dump(data, compressor=bz2) # slowest, but best compression yaml.dump(data, compressor=zlib) # medium speed and medium compression

* The current pickle tools makes it easy to exchange object trees between two Python processes. The new tool would make it equally easy to exchange object trees between processes running any of Python, Ruby, Java, C/C++, Perl, C#, PHP, OCaml, Javascript, ActionScript, and Haskell.

* Ability to use a schema for enforcing a given object model and allowing full security. Which would you rather run on untrusted data:

data = yaml.load(myfile, schema=ListOfStrings)

or

data = pickle.load(myfile)

* Specification of a schema using YAML itself

ListOfStrings (a schema written in yaml) ........................................ type: seq sequence: - type: str

Sample of valid input ..................... - foo - bar - baz

Note, schemas can be defined for very complex, nested object models and allow many kinds of constraints (unique items, enumerated list of allowable values, min/max allowable ranges for values, data type, maximum length, and names of regular Python classes that can be constructed).

* YAML is a superset of JSON, so the schema validation also works equally well with JSON encoded data.

What needs to be done ---------------------

* Combine the tools for a single, clean interface to C speed parsing of a data serialization standard, with optional compression, schema validation, and pretty printing. _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Raymond Hettinger

2:56 a.m.

...

...
Python's pickles use a custom format that has evolved over time but they have five significant disadvantages:

* it has lost its human readability and editability * is doesn't compress well

Really? Or do you mean "it doesn't have built-in compression support" ? I don't expect that running bzip2 over a pickle would produce unsatisfactory results, and the API supports reading from and writing to streams.

Or do you just mean "the representation is too repetitive and bulky" ?

...
* it isn't interoperable with other languages * it doesn't have the ability to enforce a schema * it is a major security risk for untrusted inputs

I agree that pickle doesn't satisfy these. But then again, #1, #3 and #4 were never part of its design goals. #5 is indeed a problem.

Pickle does well with its original design goal. It would be really nice if we also provided a builtin solution that incorportated the other design goals listed above and adopted a format based on a published standard.

...

But I think there are existing solutions already. For example, I'd say that XML+bzip2 satisfies all these already.

No doubt that would work. There is however a pretty high barrier to bringing together all the right tools (an xml pickler/unpickler providing the equivalent of pickle.dumps/pickle.loads, a fast xml parser, a xml schema validator, an xml pretty printer, and data compression). Even with the right tools brought together under a single convenient API, it wouldn't be any fun to write the DTDs for the validator. I think the barrier is so high, that in practice these tools will rarely be brought together for this purpose and instead are focused on ad hoc approaches geared to a particular application.

...

If you want something a little less verbose, I recommend looking at Google Protocol Buffers (http://code.google.com/apis/protocolbuffers/),

That is a nice package. It seems to have shared several of the goals listed above (interlanguage data exchange, use of schemas, and security). I scanned through all of the docs but didn't see a pickle.dumps() style API; instead, it seems to be focused on making the user build up parts of a non-subclassable custom object that knows how to serialize itself. In contrast, pyyaml rides on our existing __reduce__ logic to fully emulate what pickle can do (meaning that most apps can add serialization with just a single line). It doesn't look like the actual data formatting is based on a published standard so it requires the Google tool on each end (with support offered for Python, Java, and C++). Hope you're not negative on the idea of a compressing, validating, pretty printing, yaml pickler. Without your support, the idea is dead before it can get started. FWIW, I found some of the Kwalify examples to be compelling. Am attaching one for you guys to look at. I don't find think an equivalent XML solution would come together as effortlessly or as beautifully. From a python point of view, the example boils down to: yaml.dump(donors, file) and donors = yaml.load(file, schema=donor_schema). No extra work is required. Human readability/editability comes for free, inter-language operability comes for free, and so do the security guarantees. I think it would be great if we took a batteries included approach and offered something like this as part of the standard library. Raymond ----------- donor_schema ---------- type: seq sequence: - type: map mapping: "name": type: str required: yes "email": type: str required: yes pattern: /@/ "password": type: text length: { max: 16, min: 8 } "age": type: int range: { max: 30, min: 18 } # or assert: 18 <= val && val <= 30 "blood": type: str enum: [A, B, O, AB] "birth": type: date "deleted": type: bool default: false ----------- valid document ------------ name: foo email: foo@mail.com password: xxx123456 age: 20 blood: A birth: 1985-01-01 - name: bar email: bar@mail.net age: 25 blood: AB birth: 1980-01-01

Guido van Rossum

5:08 a.m.

On Tue, Apr 21, 2009 at 5:56 PM, Raymond Hettinger <python@rcn.com> wrote:

...

...
...
Python's pickles use a custom format that has evolved over time but they have five significant disadvantages:

* it has lost its human readability and editability * is doesn't compress well

Really? Or do you mean "it doesn't have built-in compression support" ? I don't expect that running bzip2 over a pickle would produce unsatisfactory results, and the API supports reading from and writing to streams.

Or do you just mean "the representation is too repetitive and bulky" ?

...
* it isn't interoperable with other languages * it doesn't have the ability to enforce a schema * it is a major security risk for untrusted inputs

I agree that pickle doesn't satisfy these. But then again, #1, #3 and #4 were never part of its design goals. #5 is indeed a problem.

Pickle does well with its original design goal.

It would be really nice if we also provided a builtin solution that incorportated the other design goals listed above and adopted a format based on a published standard.

...
But I think there are existing solutions already. For example, I'd say that XML+bzip2 satisfies all these already.

No doubt that would work. There is however a pretty high barrier to bringing together all the right tools (an xml pickler/unpickler providing the equivalent of pickle.dumps/pickle.loads, a fast xml parser, a xml schema validator, an xml pretty printer, and data compression). Even with the right tools brought together under a single convenient API, it wouldn't be any fun to write the DTDs for the validator. I think the barrier is so high, that in practice these tools will rarely be brought together for this purpose and instead are focused on ad hoc approaches geared to a particular application.

...
If you want something a little less verbose, I recommend looking at Google Protocol Buffers (http://code.google.com/apis/protocolbuffers/),

That is a nice package. It seems to have shared several of the goals listed above (interlanguage data exchange, use of schemas, and security).

I scanned through all of the docs but didn't see a pickle.dumps() style API; instead, it seems to be focused on making the user build up parts of a non-subclassable custom object that knows how to serialize itself. In contrast, pyyaml rides on our existing __reduce__ logic to fully emulate what pickle can do (meaning that most apps can add serialization with just a single line).

Right. That is not one of the design goals. (It also generally is incompatible with several other design goals, like cross-language support and schema enforcement -- though I now realize you mean the latter to be optional.)

...

It doesn't look like the actual data formatting is based on a published standard so it requires the Google tool on each end (with support offered for Python, Java, and C++).

I'm not too worried about the "published standard" thing. Python itself doesn't have anything like it either. :-) If you want real enterprise-level standards compliance, I doubt that anything short of XML will satisfy those die-hard conservatives. (And they probably haven't even heard of bzip2.) I don't actually think there's such a thing as a YAML standard either.

...

Hope you're not negative on the idea of a compressing, validating, pretty printing, yaml pickler. Without your support, the idea is dead before it can get started.

You have my full support -- just very few of my cycles in getting something working. I think there are enough people around here to help. I'm skeptical that trying to create something new, whether standards-based or not, is going to be worth it -- you have to be careful to define your target audience and the design goals and see if your solution would actually be enticing for that audience compared to what they can do today. (Hence my plug for Protocol Buffers -- but I'll stop now.)

...

FWIW, I found some of the Kwalify examples to be compelling. Am attaching one for you guys to look at. I don't find think an equivalent XML solution would come together as effortlessly or as beautifully. From a python point of view, the example boils down to: yaml.dump(donors, file) and donors = yaml.load(file, schema=donor_schema). No extra work is required.

How easy is it to define a schema though? What about schema migration? (An explicit goal of Protocol Buffers BTW, and in my experience very important.)

...

Human readability/editability comes for free,

How important is that though?

...

inter-language operability comes for free, and so do the security guarantees.

I think it would be great if we took a batteries included approach and offered something like this as part of the standard library.

First you have to have working code as a 3rd party package with a lot of happy users.

...

Raymond

----------- donor_schema ---------- type: seq sequence: - type: map mapping: "name": type: str required: yes "email": type: str required: yes pattern: /@/ "password": type: text length: { max: 16, min: 8 } "age": type: int range: { max: 30, min: 18 } # or assert: 18 <= val && val <= 30 "blood": type: str enum: [A, B, O, AB] "birth": type: date "deleted": type: bool default: false ----------- valid document ------------ name: foo email: foo@mail.com password: xxx123456 age: 20 blood: A birth: 1985-01-01 - name: bar email: bar@mail.net age: 25 blood: AB birth: 1980-01-01

-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Robert Kern

10:38 p.m.

On 2009-04-21 22:08, Guido van Rossum wrote:

...

On Tue, Apr 21, 2009 at 5:56 PM, Raymond Hettinger<python@rcn.com> wrote:

...

...
Human readability/editability comes for free,

How important is that though?

I've had to debug a number of pickle problems where answering "What's pulling in *that* object?" by a quick grep would have been really handy. pickletools.dis() goes part of the way, but a hierarchical text representation rather than bytecode would have been better. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Bruce Frederiksen

3:13 a.m.

Guido van Rossum wrote:

...

I was under the impression that Google Protocol Buffers neither allowed for dynamic typing, for example [1, "hi mom", []] nor for multiple references to the same object, for example: x = [1,2,3]; (x, x). Have I missing something? -bruce frederiksen

Jesse Noller

2:41 a.m.

On Tue, Apr 21, 2009 at 6:02 PM, Raymond Hettinger <python@rcn.com> wrote:

...

Motivation ----------

Python's pickles use a custom format that has evolved over time but they have five significant disadvantages:

* it has lost its human readability and editability * is doesn't compress well * it isn't interoperable with other languages * it doesn't have the ability to enforce a schema * it is a major security risk for untrusted inputs

New idea --------

Develop a solution using a mix of PyYAML, a python coded version of Kwalify, optional compression using bz2, gzip, or zlib, and pretty printing using pygments.

YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard for data serialization.

PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of the YAML standard. It uses the YAML's application-specific tags and Python's own copy/reduce logic to provide the same power as pickle itself.

Kwalify ( http://www.kuwata-lab.com/kwalify/ruby/users-guide.01.html ) is a schema validator written in Ruby and Java. It defines a YAML/JSON based schema definition for enforcing tight constraints on incoming data.

The bz2, gzip, and zlib compression libraries are already built into the language.

Pygments ( http://pygments.org/ ) is python based syntax highlighter with builtin support for YAML.

Advantages ----------

* The format is simple enough to hand edit or to have lightweight applications emit valid pickles. For example:

print('Todo: [go to bank, pick up food, write code]') # valid pickle

* To date, efforts to make pickles smaller have focused on creating new codes for every data type. Instead, we can use the simple text formatting of YAML and let general purpose data compression utilities do their job (letting the user control the trade-offs between speed, space, and human readability):

yaml.dump(data, compressor=None) # fast, human readable, no compression yaml.dump(data, compressor=bz2) # slowest, but best compression yaml.dump(data, compressor=zlib) # medium speed and medium compression

* The current pickle tools makes it easy to exchange object trees between two Python processes. The new tool would make it equally easy to exchange object trees between processes running any of Python, Ruby, Java, C/C++, Perl, C#, PHP, OCaml, Javascript, ActionScript, and Haskell.

* Ability to use a schema for enforcing a given object model and allowing full security. Which would you rather run on untrusted data:

data = yaml.load(myfile, schema=ListOfStrings)

or

data = pickle.load(myfile)

* Specification of a schema using YAML itself

ListOfStrings (a schema written in yaml) ........................................ type: seq sequence: - type: str

Sample of valid input ..................... - foo - bar - baz

Note, schemas can be defined for very complex, nested object models and allow many kinds of constraints (unique items, enumerated list of allowable values, min/max allowable ranges for values, data type, maximum length, and names of regular Python classes that can be constructed).

* YAML is a superset of JSON, so the schema validation also works equally well with JSON encoded data.

What needs to be done ---------------------

* Combine the tools for a single, clean interface to C speed parsing of a data serialization standard, with optional compression, schema validation, and pretty printing.

Just to add to this, I remembered someone recently did a simple benchmark of thift/JSON/YAML/Protocol Buffers, here are the links: http://www.bouncybouncy.net/ramblings/posts/thrift_and_protocol_buffers/ http://www.bouncybouncy.net/ramblings/posts/more_on_json_vs_thrift_and_proto... http://www.bouncybouncy.net/ramblings/posts/json_vs_thrift_and_protocol_buff... Without digging into the numbers too much, it's worth noting that PyYAML is written in pure python but also has Libyaml (http://pyyaml.org/wiki/LibYAML) bindings for speed. When I get a chance, I can run the same test(s) with both the pure-python implementation and the libyaml one as well as see how much the speedup is. We would definitely need a c-based parser/emitter for something like this to really fly. jesse jesse

Jesse Noller

3:10 p.m.

On Tue, Apr 21, 2009 at 8:41 PM, Jesse Noller <jnoller@gmail.com> wrote:

...

On Tue, Apr 21, 2009 at 6:02 PM, Raymond Hettinger <python@rcn.com> wrote:

...
Motivation ----------

Python's pickles use a custom format that has evolved over time but they have five significant disadvantages:

* it has lost its human readability and editability * is doesn't compress well * it isn't interoperable with other languages * it doesn't have the ability to enforce a schema * it is a major security risk for untrusted inputs

New idea --------

Develop a solution using a mix of PyYAML, a python coded version of Kwalify, optional compression using bz2, gzip, or zlib, and pretty printing using pygments.

YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard for data serialization.

PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of the YAML standard. It uses the YAML's application-specific tags and Python's own copy/reduce logic to provide the same power as pickle itself.

Kwalify ( http://www.kuwata-lab.com/kwalify/ruby/users-guide.01.html ) is a schema validator written in Ruby and Java. It defines a YAML/JSON based schema definition for enforcing tight constraints on incoming data.

The bz2, gzip, and zlib compression libraries are already built into the language.

Pygments ( http://pygments.org/ ) is python based syntax highlighter with builtin support for YAML.

Advantages ----------

* The format is simple enough to hand edit or to have lightweight applications emit valid pickles. For example:

print('Todo: [go to bank, pick up food, write code]') # valid pickle

* To date, efforts to make pickles smaller have focused on creating new codes for every data type. Instead, we can use the simple text formatting of YAML and let general purpose data compression utilities do their job (letting the user control the trade-offs between speed, space, and human readability):

yaml.dump(data, compressor=None) # fast, human readable, no compression yaml.dump(data, compressor=bz2) # slowest, but best compression yaml.dump(data, compressor=zlib) # medium speed and medium compression

* The current pickle tools makes it easy to exchange object trees between two Python processes. The new tool would make it equally easy to exchange object trees between processes running any of Python, Ruby, Java, C/C++, Perl, C#, PHP, OCaml, Javascript, ActionScript, and Haskell.

* Ability to use a schema for enforcing a given object model and allowing full security. Which would you rather run on untrusted data:

data = yaml.load(myfile, schema=ListOfStrings)

or

data = pickle.load(myfile)

* Specification of a schema using YAML itself

ListOfStrings (a schema written in yaml) ........................................ type: seq sequence: - type: str

Sample of valid input ..................... - foo - bar - baz

Note, schemas can be defined for very complex, nested object models and allow many kinds of constraints (unique items, enumerated list of allowable values, min/max allowable ranges for values, data type, maximum length, and names of regular Python classes that can be constructed).

* YAML is a superset of JSON, so the schema validation also works equally well with JSON encoded data.

What needs to be done ---------------------

* Combine the tools for a single, clean interface to C speed parsing of a data serialization standard, with optional compression, schema validation, and pretty printing.

Just to add to this, I remembered someone recently did a simple benchmark of thift/JSON/YAML/Protocol Buffers, here are the links:

http://www.bouncybouncy.net/ramblings/posts/thrift_and_protocol_buffers/ http://www.bouncybouncy.net/ramblings/posts/more_on_json_vs_thrift_and_proto... http://www.bouncybouncy.net/ramblings/posts/json_vs_thrift_and_protocol_buff...

Without digging into the numbers too much, it's worth noting that PyYAML is written in pure python but also has Libyaml (http://pyyaml.org/wiki/LibYAML) bindings for speed. When I get a chance, I can run the same test(s) with both the pure-python implementation and the libyaml one as well as see how much the speedup is.

Speaking of benchmarks, last night I took the first benchmark cited in the links above, and with some work I ran the same benchmark with PyYAML (pure python) and PyYAML with libyaml (the C version). The PyYAML -> libyaml bindings require Cython right now, but here are the numbers. I removed thift and proto buffers, as I wanted to focus on YAML/JSON right now: 5000 total records (0.510s) ser_json (0.030s) 718147 bytes ser_cjson (0.030s) 718147 bytes ser_yaml (6.230s) 623147 bytes ser_cyaml (2.040s) 623147 bytes ser_json_compressed (0.100s) 292987 bytes ser_cjson_compressed (0.110s) 292987 bytes ser_yaml_compressed (6.310s) 291018 bytes ser_cyaml_compressed (2.140s) 291018 bytes serde_json (0.050s) serde_cjson (0.050s) serde_yaml (19.020s) serde_cyaml (4.460s) Running the second benchmark (the integer one) I see: 10000 total records (0.130s) ser_json (0.040s) 680749 bytes ser_cjson (0.030s) 680749 bytes ser_yaml (8.250s) 610749 bytes ser_cyaml (3.040s) 610749 bytes ser_json_compressed (0.100s) 124924 bytes ser_cjson_compressed (0.090s) 124924 bytes ser_yaml_compressed (8.320s) 121090 bytes ser_cyaml_compressed (3.110s) 121090 bytes serde_json (0.060s) serde_cjson (0.070s) serde_yaml (24.190s) serde_cyaml (6.690s) So yes, the pure python numbers for yaml (_yaml) are pretty bad; the libyaml (_cyaml) numbers are significantly improved, but not as fast as JSON/CJSON. One thing to note in this discussion as others have pointed out, while JSON itself it awfully fast/nice, it lacks some of the capabilities of YAML, for example certain objects can not be represented in JSON. Additionally, if we want to simply state "objects which you desire to be compatible with JSON have the following restrictions" we can - this means we can also leverage things within PyYAML which are also nice-to-haves, for example the !!python additions. Picking YAML in this case means we get all of the YAML syntax, objects, etc - and if consumers want to stick with JSON compatibility, we could add a dump(canonical=True, compatibility=JSON) or somesuch flag. jesse

David Stanek

5:51 p.m.

On Wed, Apr 22, 2009 at 9:10 AM, Jesse Noller <jnoller@gmail.com> wrote:

...

I think that someone should create a new pickle module and put it up on the Cheeseshop. I would prefer explicit dumpJSON and dumpYAML functions, but I realize that breaks the current interface. If people bite then start talking about using it as pickle's implementation. -- David blog: http://www.traceback.org twitter: http://twitter.com/dstanek

Josiah Carlson

7:19 a.m.

On Wed, Apr 22, 2009 at 6:10 AM, Jesse Noller <jnoller@gmail.com> wrote:

...

On Tue, Apr 21, 2009 at 8:41 PM, Jesse Noller <jnoller@gmail.com> wrote:

...
On Tue, Apr 21, 2009 at 6:02 PM, Raymond Hettinger <python@rcn.com> wrote:

...
Motivation ----------

Python's pickles use a custom format that has evolved over time but they have five significant disadvantages:

* it has lost its human readability and editability * is doesn't compress well * it isn't interoperable with other languages * it doesn't have the ability to enforce a schema * it is a major security risk for untrusted inputs

New idea --------

Develop a solution using a mix of PyYAML, a python coded version of Kwalify, optional compression using bz2, gzip, or zlib, and pretty printing using pygments.

YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard for data serialization.

PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of the YAML standard. It uses the YAML's application-specific tags and Python's own copy/reduce logic to provide the same power as pickle itself.

Kwalify ( http://www.kuwata-lab.com/kwalify/ruby/users-guide.01.html ) is a schema validator written in Ruby and Java. It defines a YAML/JSON based schema definition for enforcing tight constraints on incoming data.

The bz2, gzip, and zlib compression libraries are already built into the language.

Pygments ( http://pygments.org/ ) is python based syntax highlighter with builtin support for YAML.

Advantages ----------

* The format is simple enough to hand edit or to have lightweight applications emit valid pickles. For example:

print('Todo: [go to bank, pick up food, write code]') # valid pickle

* To date, efforts to make pickles smaller have focused on creating new codes for every data type. Instead, we can use the simple text formatting of YAML and let general purpose data compression utilities do their job (letting the user control the trade-offs between speed, space, and human readability):

yaml.dump(data, compressor=None) # fast, human readable, no compression yaml.dump(data, compressor=bz2) # slowest, but best compression yaml.dump(data, compressor=zlib) # medium speed and medium compression

* The current pickle tools makes it easy to exchange object trees between two Python processes. The new tool would make it equally easy to exchange object trees between processes running any of Python, Ruby, Java, C/C++, Perl, C#, PHP, OCaml, Javascript, ActionScript, and Haskell.

* Ability to use a schema for enforcing a given object model and allowing full security. Which would you rather run on untrusted data:

data = yaml.load(myfile, schema=ListOfStrings)

or

data = pickle.load(myfile)

* Specification of a schema using YAML itself

ListOfStrings (a schema written in yaml) ........................................ type: seq sequence: - type: str

Sample of valid input ..................... - foo - bar - baz

Note, schemas can be defined for very complex, nested object models and allow many kinds of constraints (unique items, enumerated list of allowable values, min/max allowable ranges for values, data type, maximum length, and names of regular Python classes that can be constructed).

* YAML is a superset of JSON, so the schema validation also works equally well with JSON encoded data.

What needs to be done ---------------------

* Combine the tools for a single, clean interface to C speed parsing of a data serialization standard, with optional compression, schema validation, and pretty printing.

Just to add to this, I remembered someone recently did a simple benchmark of thift/JSON/YAML/Protocol Buffers, here are the links:

http://www.bouncybouncy.net/ramblings/posts/thrift_and_protocol_buffers/ http://www.bouncybouncy.net/ramblings/posts/more_on_json_vs_thrift_and_proto... http://www.bouncybouncy.net/ramblings/posts/json_vs_thrift_and_protocol_buff...

Without digging into the numbers too much, it's worth noting that PyYAML is written in pure python but also has Libyaml (http://pyyaml.org/wiki/LibYAML) bindings for speed. When I get a chance, I can run the same test(s) with both the pure-python implementation and the libyaml one as well as see how much the speedup is.

Speaking of benchmarks, last night I took the first benchmark cited in the links above, and with some work I ran the same benchmark with PyYAML (pure python) and PyYAML with libyaml (the C version). The PyYAML -> libyaml bindings require Cython right now, but here are the numbers.

I removed thift and proto buffers, as I wanted to focus on YAML/JSON right now:

5000 total records (0.510s)

ser_json (0.030s) 718147 bytes ser_cjson (0.030s) 718147 bytes ser_yaml (6.230s) 623147 bytes ser_cyaml (2.040s) 623147 bytes

ser_json_compressed (0.100s) 292987 bytes ser_cjson_compressed (0.110s) 292987 bytes ser_yaml_compressed (6.310s) 291018 bytes ser_cyaml_compressed (2.140s) 291018 bytes

serde_json (0.050s) serde_cjson (0.050s) serde_yaml (19.020s) serde_cyaml (4.460s)

Running the second benchmark (the integer one) I see:

10000 total records (0.130s)

ser_json (0.040s) 680749 bytes ser_cjson (0.030s) 680749 bytes ser_yaml (8.250s) 610749 bytes ser_cyaml (3.040s) 610749 bytes

ser_json_compressed (0.100s) 124924 bytes ser_cjson_compressed (0.090s) 124924 bytes ser_yaml_compressed (8.320s) 121090 bytes ser_cyaml_compressed (3.110s) 121090 bytes

serde_json (0.060s) serde_cjson (0.070s) serde_yaml (24.190s) serde_cyaml (6.690s)

So yes, the pure python numbers for yaml (_yaml) are pretty bad; the libyaml (_cyaml) numbers are significantly improved, but not as fast as JSON/CJSON.

Saying "not as fast" is a bit misleading. Roughly 100x slower than json is a more precise description, and is one of the major reasons why a bunch of people stick with json rather than yaml.

...

In fact, no custom objects are generally representable with json...unless you use the custom encoders/decoders like simplejson has. However, in the times when I've used json, being able to store arbitrary Python objects wasn't a huge chore. I just threw a 'to_json()' method on every object that I wanted to serialize, they knew about their contents, and would check for 'to_json()' methods as necessary. Deserialization just meant passing the lists, dictionaries, etc., to a base '.from_json()' classmethod, which did all of the right stuff. It was trivial, it worked, and it was fast.

...

My vote is to keep it simple and fast. JSON satisfies that. YAML doesn't. While I appreciate the desire to be able to store recursive references, I don't believe it's necessary in the general case. - Josiah

Leonardo Santagada

3:02 a.m.

For starters I want to say that this is a great idea overall but why not: On Apr 21, 2009, at 7:02 PM, Raymond Hettinger wrote:

...

With the already existing fast json serializers (simplejson) why not use JSON instead of YAML? I like the YAML language, but JSON serializers exists for many more languages and are usually much more used than YAML ones. -- Leonardo Santagada santagada at gmail.com

Josiah Carlson

5:31 a.m.

On Tue, Apr 21, 2009 at 6:02 PM, Leonardo Santagada <santagada@gmail.com> wrote:

...

+1 for json. I've had bad luck with yaml in the past, but only have good things to say about json. - Josiah

Jeremiah Dodds

10:33 a.m.

On Wed, Apr 22, 2009 at 4:31 AM, Josiah Carlson <josiah.carlson@gmail.com>wrote:

...

Another +1 for json here. I use it to communicate with python and perl on a pretty regular basis, and never have a problem with it. It's a dead simple serialization protocol, very hard to get wrong, and very interoperable.

Dirkjan Ochtman

10:12 a.m.

On 22/04/2009 03:02, Leonardo Santagada wrote:

...

+1 from me as well. YAML is still evolving, and MUCH more complex than JSON. JSON is simpler and has an IETF RFC to back it (much shorter than the YAML spec -- and the flow diagrams on json.org are really all you need). JSON schema might be used to do schemas [1]. Some kind of versioning would also be very useful [2,3]. I've had a lot of trouble with unpickling pickles for which the classes had changed; you often don't really get a useful error message. Cheers, Dirkjan [1] http://www.json.com/json-schema-proposal/ [2] http://utcc.utoronto.ca/~cks/space/blog/python/PickleNotForSaving [3] http://utcc.utoronto.ca/~cks/space/blog/python/VersioningPickle

Daniel Stutzbach

11:39 a.m.

On Tue, Apr 21, 2009 at 8:02 PM, Leonardo Santagada <santagada@gmail.com>wrote:

...

JSON's appeal is in its simplicity, but it's TOO simple to serve as a replacement for pickle. For example, it can't encode recursive objects. Since YAML is a superset of JSON, it's a very natural choice for those already familiar with JSON who need a little more power. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

Daniel Stutzbach

11:53 a.m.

On Tue, Apr 21, 2009 at 5:02 PM, Raymond Hettinger <python@rcn.com> wrote:

...

+1 on the general idea. I abandoned pickle for JSON/YAML long ago. To be useful, wouldn't a schema validator have to be built-in to the YAML parser? -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

Antoine Pitrou

8:38 p.m.

Raymond Hettinger <python@...> writes:

...

Do you mean the binary representation is already memory efficient enough? It doesn't sound like a disadvantage.

...

* it is a major security risk for untrusted inputs

Any untrusted input is a security risk. I don't see how enforcing that the values received are strings or numbers is enough to guarantee security. It all depends on the context. For example, if the strings are meant to be interpreted as filenames, you'd better check that the user doesn't try to mess with system files. Regards Antoine.

Alexandre Vassalotti

11:14 p.m.

On Tue, Apr 21, 2009 at 6:02 PM, Raymond Hettinger <python@rcn.com> wrote:

...

Motivation ----------

Python's pickles use a custom format that has evolved over time but they have five significant disadvantages:

* it has lost its human readability and editability

This is not part of pickle design goals. Also, I don't think the pickle protocol ever been a human-friendly format. Even if protocol 0 is ASCII-based, it doesn't mean one would like to edit it by hand.

...

* is doesn't compress well

...

From my experience with pickle, I doubt you can improve significantly

Do you have numbers to support this? The last time I tested compression on pickle data, it worked fairly well. In fact, I get a 2.70 compression ratio for some pickles using gzip. the size of pickled data, without using static schemata (like Google Protocol Buffers and Thrift). The only inefficient thing in pickle, I am aware of, is the handling of PUT and GET opcodes.

...

* it isn't interoperable with other languages * it doesn't have the ability to enforce a schema

Again, these are not part of pickle's design goals.

...

* it is a major security risk for untrusted inputs

There are way to fix this without replacing pickle. See the recipe in pickle documentation: http://docs.python.org/3.0/library/pickle.html#restricting-globals

...

New idea --------

Develop a solution using a mix of PyYAML, a python coded version of Kwalify, optional compression using bz2, gzip, or zlib, and pretty printing using pygments.

YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard for data serialization.

PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of the YAML standard. It uses the YAML's application-specific tags and Python's own copy/reduce logic to provide the same power as pickle itself.

But how are you going to handle serialization of class instances in a language independent manner? Regards, -- Alexandre

Raymond Hettinger

11:27 p.m.

...

...
* it has lost its human readability and editability

...

This is not part of pickle design goals.

However, it's one of my design goals for something better than the pickle we have now. One benefit is that it eliminates the need for a pickle disassembler. Another benefit is that valid pickles can be created easily by something other than a pickler (experiences with json have shown this to be useful). It's nice for a random javascript fragment or iphone app to just be able to print a valid pickle for a particular use. It's all about loose coupling. Also, human readability goes hand in hand with the new design goal for language independence. It's a lot easier to design and test two-way communication with Java, C++, and others if you can easily see what is in the pickler. Raymond

Terry Reedy

5:14 a.m.

Alexandre Vassalotti wrote:

...

On reading that, I notice that it ends with "As our examples shows, you have to be careful with what you allow to be unpickled. Therefore if security is a concern, you may want to consider alternatives such as the marshalling API in xmlrpc.client or third-party solutions." Raymond's proposal is to integrate some third-parth solutions with an eye to the product becoming a first-party solution.

Jesse Noller

April 2009

12:10 a.m.

On Apr 21, 2009, at 6:02 PM, "Raymond Hettinger" <python@rcn.com> wrote:

...

Motivation ----------

Python's pickles use a custom format that has evolved over time but they have five significant disadvantages:

* it has lost its human readability and editability * is doesn't compress well * it isn't interoperable with other languages * it doesn't have the ability to enforce a schema * it is a major security risk for untrusted inputs

New idea --------

Develop a solution using a mix of PyYAML, a python coded version of Kwalify, optional compression using bz2, gzip, or zlib, and pretty printing using pygments.

YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard for data serialization.

PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of the YAML standard. It uses the YAML's application-specific tags and Python's own copy/reduce logic to provide the same power as pickle itself.

Kwalify ( http://www.kuwata-lab.com/kwalify/ruby/users-guide.01.html ) is a schema validator written in Ruby and Java. It defines a YAML/JSON based schema definition for enforcing tight constraints on incoming data.

The bz2, gzip, and zlib compression libraries are already built into the language.

Pygments ( http://pygments.org/ ) is python based syntax highlighter with builtin support for YAML.

Advantages ----------

* The format is simple enough to hand edit or to have lightweight applications emit valid pickles. For example:

print('Todo: [go to bank, pick up food, write code]') # valid pickle

* To date, efforts to make pickles smaller have focused on creating new codes for every data type. Instead, we can use the simple text formatting of YAML and let general purpose data compression utilities do their job (letting the user control the trade-offs between speed, space, and human readability):

yaml.dump(data, compressor=None) # fast, human readable, no compression yaml.dump(data, compressor=bz2) # slowest, but best compression yaml.dump(data, compressor=zlib) # medium speed and medium compression

* The current pickle tools makes it easy to exchange object trees between two Python processes. The new tool would make it equally easy to exchange object trees between processes running any of Python, Ruby, Java, C/C++, Perl, C#, PHP, OCaml, Javascript, ActionScript, and Haskell.

* Ability to use a schema for enforcing a given object model and allowing full security. Which would you rather run on untrusted data:

data = yaml.load(myfile, schema=ListOfStrings)

or

data = pickle.load(myfile)

* Specification of a schema using YAML itself

ListOfStrings (a schema written in yaml) ........................................ type: seq sequence: - type: str

Sample of valid input ..................... - foo - bar - baz

Note, schemas can be defined for very complex, nested object models and allow many kinds of constraints (unique items, enumerated list of allowable values, min/max allowable ranges for values, data type, maximum length, and names of regular Python classes that can be constructed).

* YAML is a superset of JSON, so the schema validation also works equally well with JSON encoded data.

What needs to be done ---------------------

* Combine the tools for a single, clean interface to C speed parsing of a data serialization standard, with optional compression, schema validation, and pretty printing.

A huge +1 from me, I've used YAML quite a bit, and as a cross language communications format it's quite nice. Jesse

Guido van Rossum

12:26 a.m.

On Tue, Apr 21, 2009 at 3:02 PM, Raymond Hettinger <python@rcn.com> wrote:

...

New idea --------

Develop a solution using a mix of PyYAML, a python coded version of Kwalify, optional compression using bz2, gzip, or zlib, and pretty printing using pygments.

YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard for data serialization.

PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of the YAML standard. It uses the YAML's application-specific tags and Python's own copy/reduce logic to provide the same power as pickle itself.

Kwalify ( http://www.kuwata-lab.com/kwalify/ruby/users-guide.01.html ) is a schema validator written in Ruby and Java. It defines a YAML/JSON based schema definition for enforcing tight constraints on incoming data.

The bz2, gzip, and zlib compression libraries are already built into the language.

Pygments ( http://pygments.org/ ) is python based syntax highlighter with builtin support for YAML.

Advantages ----------

* The format is simple enough to hand edit or to have lightweight applications emit valid pickles. For example:

print('Todo: [go to bank, pick up food, write code]') # valid pickle

* To date, efforts to make pickles smaller have focused on creating new codes for every data type. Instead, we can use the simple text formatting of YAML and let general purpose data compression utilities do their job (letting the user control the trade-offs between speed, space, and human readability):

yaml.dump(data, compressor=None) # fast, human readable, no compression yaml.dump(data, compressor=bz2) # slowest, but best compression yaml.dump(data, compressor=zlib) # medium speed and medium compression

* The current pickle tools makes it easy to exchange object trees between two Python processes. The new tool would make it equally easy to exchange object trees between processes running any of Python, Ruby, Java, C/C++, Perl, C#, PHP, OCaml, Javascript, ActionScript, and Haskell.

* Ability to use a schema for enforcing a given object model and allowing full security. Which would you rather run on untrusted data:

data = yaml.load(myfile, schema=ListOfStrings)

or

data = pickle.load(myfile)

* Specification of a schema using YAML itself

ListOfStrings (a schema written in yaml) ........................................ type: seq sequence: - type: str

Sample of valid input ..................... - foo - bar - baz

Note, schemas can be defined for very complex, nested object models and allow many kinds of constraints (unique items, enumerated list of allowable values, min/max allowable ranges for values, data type, maximum length, and names of regular Python classes that can be constructed).

* YAML is a superset of JSON, so the schema validation also works equally well with JSON encoded data.

What needs to be done ---------------------

* Combine the tools for a single, clean interface to C speed parsing of a data serialization standard, with optional compression, schema validation, and pretty printing. _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Raymond Hettinger

2:56 a.m.

...

...
Python's pickles use a custom format that has evolved over time but they have five significant disadvantages:

* it has lost its human readability and editability * is doesn't compress well

Really? Or do you mean "it doesn't have built-in compression support" ? I don't expect that running bzip2 over a pickle would produce unsatisfactory results, and the API supports reading from and writing to streams.

Or do you just mean "the representation is too repetitive and bulky" ?

...
* it isn't interoperable with other languages * it doesn't have the ability to enforce a schema * it is a major security risk for untrusted inputs

I agree that pickle doesn't satisfy these. But then again, #1, #3 and #4 were never part of its design goals. #5 is indeed a problem.

...

But I think there are existing solutions already. For example, I'd say that XML+bzip2 satisfies all these already.

...

If you want something a little less verbose, I recommend looking at Google Protocol Buffers (http://code.google.com/apis/protocolbuffers/),

Guido van Rossum

5:08 a.m.

On Tue, Apr 21, 2009 at 5:56 PM, Raymond Hettinger <python@rcn.com> wrote:

...

...
...
Python's pickles use a custom format that has evolved over time but they have five significant disadvantages:

* it has lost its human readability and editability * is doesn't compress well

Really? Or do you mean "it doesn't have built-in compression support" ? I don't expect that running bzip2 over a pickle would produce unsatisfactory results, and the API supports reading from and writing to streams.

Or do you just mean "the representation is too repetitive and bulky" ?

...
* it isn't interoperable with other languages * it doesn't have the ability to enforce a schema * it is a major security risk for untrusted inputs

I agree that pickle doesn't satisfy these. But then again, #1, #3 and #4 were never part of its design goals. #5 is indeed a problem.

Pickle does well with its original design goal.

It would be really nice if we also provided a builtin solution that incorportated the other design goals listed above and adopted a format based on a published standard.

...
But I think there are existing solutions already. For example, I'd say that XML+bzip2 satisfies all these already.

No doubt that would work. There is however a pretty high barrier to bringing together all the right tools (an xml pickler/unpickler providing the equivalent of pickle.dumps/pickle.loads, a fast xml parser, a xml schema validator, an xml pretty printer, and data compression). Even with the right tools brought together under a single convenient API, it wouldn't be any fun to write the DTDs for the validator. I think the barrier is so high, that in practice these tools will rarely be brought together for this purpose and instead are focused on ad hoc approaches geared to a particular application.

...
If you want something a little less verbose, I recommend looking at Google Protocol Buffers (http://code.google.com/apis/protocolbuffers/),

That is a nice package. It seems to have shared several of the goals listed above (interlanguage data exchange, use of schemas, and security).

I scanned through all of the docs but didn't see a pickle.dumps() style API; instead, it seems to be focused on making the user build up parts of a non-subclassable custom object that knows how to serialize itself. In contrast, pyyaml rides on our existing __reduce__ logic to fully emulate what pickle can do (meaning that most apps can add serialization with just a single line).

...

It doesn't look like the actual data formatting is based on a published standard so it requires the Google tool on each end (with support offered for Python, Java, and C++).

...

Hope you're not negative on the idea of a compressing, validating, pretty printing, yaml pickler. Without your support, the idea is dead before it can get started.

...

FWIW, I found some of the Kwalify examples to be compelling. Am attaching one for you guys to look at. I don't find think an equivalent XML solution would come together as effortlessly or as beautifully. From a python point of view, the example boils down to: yaml.dump(donors, file) and donors = yaml.load(file, schema=donor_schema). No extra work is required.

How easy is it to define a schema though? What about schema migration? (An explicit goal of Protocol Buffers BTW, and in my experience very important.)

...

Human readability/editability comes for free,

How important is that though?

...

inter-language operability comes for free, and so do the security guarantees.

I think it would be great if we took a batteries included approach and offered something like this as part of the standard library.

First you have to have working code as a 3rd party package with a lot of happy users.

...

Raymond

----------- donor_schema ---------- type: seq sequence: - type: map mapping: "name": type: str required: yes "email": type: str required: yes pattern: /@/ "password": type: text length: { max: 16, min: 8 } "age": type: int range: { max: 30, min: 18 } # or assert: 18 <= val && val <= 30 "blood": type: str enum: [A, B, O, AB] "birth": type: date "deleted": type: bool default: false ----------- valid document ------------ name: foo email: foo@mail.com password: xxx123456 age: 20 blood: A birth: 1985-01-01 - name: bar email: bar@mail.net age: 25 blood: AB birth: 1980-01-01

-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Robert Kern

10:38 p.m.

On 2009-04-21 22:08, Guido van Rossum wrote:

...

On Tue, Apr 21, 2009 at 5:56 PM, Raymond Hettinger<python@rcn.com> wrote:

...

...
Human readability/editability comes for free,

How important is that though?

Bruce Frederiksen

3:13 a.m.

Guido van Rossum wrote:

...

Jesse Noller

April 2009

2:41 a.m.

On Tue, Apr 21, 2009 at 6:02 PM, Raymond Hettinger <python@rcn.com> wrote:

...

Motivation ----------

Python's pickles use a custom format that has evolved over time but they have five significant disadvantages:

* it has lost its human readability and editability * is doesn't compress well * it isn't interoperable with other languages * it doesn't have the ability to enforce a schema * it is a major security risk for untrusted inputs

New idea --------

Develop a solution using a mix of PyYAML, a python coded version of Kwalify, optional compression using bz2, gzip, or zlib, and pretty printing using pygments.

YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard for data serialization.

PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of the YAML standard. It uses the YAML's application-specific tags and Python's own copy/reduce logic to provide the same power as pickle itself.

Kwalify ( http://www.kuwata-lab.com/kwalify/ruby/users-guide.01.html ) is a schema validator written in Ruby and Java. It defines a YAML/JSON based schema definition for enforcing tight constraints on incoming data.

The bz2, gzip, and zlib compression libraries are already built into the language.

Pygments ( http://pygments.org/ ) is python based syntax highlighter with builtin support for YAML.

Advantages ----------

* The format is simple enough to hand edit or to have lightweight applications emit valid pickles. For example:

print('Todo: [go to bank, pick up food, write code]') # valid pickle

* To date, efforts to make pickles smaller have focused on creating new codes for every data type. Instead, we can use the simple text formatting of YAML and let general purpose data compression utilities do their job (letting the user control the trade-offs between speed, space, and human readability):

yaml.dump(data, compressor=None) # fast, human readable, no compression yaml.dump(data, compressor=bz2) # slowest, but best compression yaml.dump(data, compressor=zlib) # medium speed and medium compression

* The current pickle tools makes it easy to exchange object trees between two Python processes. The new tool would make it equally easy to exchange object trees between processes running any of Python, Ruby, Java, C/C++, Perl, C#, PHP, OCaml, Javascript, ActionScript, and Haskell.

* Ability to use a schema for enforcing a given object model and allowing full security. Which would you rather run on untrusted data:

data = yaml.load(myfile, schema=ListOfStrings)

or

data = pickle.load(myfile)

* Specification of a schema using YAML itself

ListOfStrings (a schema written in yaml) ........................................ type: seq sequence: - type: str

Sample of valid input ..................... - foo - bar - baz

Note, schemas can be defined for very complex, nested object models and allow many kinds of constraints (unique items, enumerated list of allowable values, min/max allowable ranges for values, data type, maximum length, and names of regular Python classes that can be constructed).

* YAML is a superset of JSON, so the schema validation also works equally well with JSON encoded data.

What needs to be done ---------------------

* Combine the tools for a single, clean interface to C speed parsing of a data serialization standard, with optional compression, schema validation, and pretty printing.

Jesse Noller

3:10 p.m.

On Tue, Apr 21, 2009 at 8:41 PM, Jesse Noller <jnoller@gmail.com> wrote:

...

On Tue, Apr 21, 2009 at 6:02 PM, Raymond Hettinger <python@rcn.com> wrote:

...
Motivation ----------

Python's pickles use a custom format that has evolved over time but they have five significant disadvantages:

* it has lost its human readability and editability * is doesn't compress well * it isn't interoperable with other languages * it doesn't have the ability to enforce a schema * it is a major security risk for untrusted inputs

New idea --------

Develop a solution using a mix of PyYAML, a python coded version of Kwalify, optional compression using bz2, gzip, or zlib, and pretty printing using pygments.

YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard for data serialization.

PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of the YAML standard. It uses the YAML's application-specific tags and Python's own copy/reduce logic to provide the same power as pickle itself.

Kwalify ( http://www.kuwata-lab.com/kwalify/ruby/users-guide.01.html ) is a schema validator written in Ruby and Java. It defines a YAML/JSON based schema definition for enforcing tight constraints on incoming data.

The bz2, gzip, and zlib compression libraries are already built into the language.

Pygments ( http://pygments.org/ ) is python based syntax highlighter with builtin support for YAML.

Advantages ----------

* The format is simple enough to hand edit or to have lightweight applications emit valid pickles. For example:

print('Todo: [go to bank, pick up food, write code]') # valid pickle

* To date, efforts to make pickles smaller have focused on creating new codes for every data type. Instead, we can use the simple text formatting of YAML and let general purpose data compression utilities do their job (letting the user control the trade-offs between speed, space, and human readability):

yaml.dump(data, compressor=None) # fast, human readable, no compression yaml.dump(data, compressor=bz2) # slowest, but best compression yaml.dump(data, compressor=zlib) # medium speed and medium compression

* The current pickle tools makes it easy to exchange object trees between two Python processes. The new tool would make it equally easy to exchange object trees between processes running any of Python, Ruby, Java, C/C++, Perl, C#, PHP, OCaml, Javascript, ActionScript, and Haskell.

* Ability to use a schema for enforcing a given object model and allowing full security. Which would you rather run on untrusted data:

data = yaml.load(myfile, schema=ListOfStrings)

or

data = pickle.load(myfile)

* Specification of a schema using YAML itself

ListOfStrings (a schema written in yaml) ........................................ type: seq sequence: - type: str

Sample of valid input ..................... - foo - bar - baz

Note, schemas can be defined for very complex, nested object models and allow many kinds of constraints (unique items, enumerated list of allowable values, min/max allowable ranges for values, data type, maximum length, and names of regular Python classes that can be constructed).

* YAML is a superset of JSON, so the schema validation also works equally well with JSON encoded data.

What needs to be done ---------------------

* Combine the tools for a single, clean interface to C speed parsing of a data serialization standard, with optional compression, schema validation, and pretty printing.

Just to add to this, I remembered someone recently did a simple benchmark of thift/JSON/YAML/Protocol Buffers, here are the links:

http://www.bouncybouncy.net/ramblings/posts/thrift_and_protocol_buffers/ http://www.bouncybouncy.net/ramblings/posts/more_on_json_vs_thrift_and_proto... http://www.bouncybouncy.net/ramblings/posts/json_vs_thrift_and_protocol_buff...

Without digging into the numbers too much, it's worth noting that PyYAML is written in pure python but also has Libyaml (http://pyyaml.org/wiki/LibYAML) bindings for speed. When I get a chance, I can run the same test(s) with both the pure-python implementation and the libyaml one as well as see how much the speedup is.

David Stanek

5:51 p.m.

On Wed, Apr 22, 2009 at 9:10 AM, Jesse Noller <jnoller@gmail.com> wrote:

...

Josiah Carlson

7:19 a.m.

On Wed, Apr 22, 2009 at 6:10 AM, Jesse Noller <jnoller@gmail.com> wrote:

...

On Tue, Apr 21, 2009 at 8:41 PM, Jesse Noller <jnoller@gmail.com> wrote:

...
On Tue, Apr 21, 2009 at 6:02 PM, Raymond Hettinger <python@rcn.com> wrote:

...
Motivation ----------

Python's pickles use a custom format that has evolved over time but they have five significant disadvantages:

* it has lost its human readability and editability * is doesn't compress well * it isn't interoperable with other languages * it doesn't have the ability to enforce a schema * it is a major security risk for untrusted inputs

New idea --------

Develop a solution using a mix of PyYAML, a python coded version of Kwalify, optional compression using bz2, gzip, or zlib, and pretty printing using pygments.

YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard for data serialization.

PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of the YAML standard. It uses the YAML's application-specific tags and Python's own copy/reduce logic to provide the same power as pickle itself.

Kwalify ( http://www.kuwata-lab.com/kwalify/ruby/users-guide.01.html ) is a schema validator written in Ruby and Java. It defines a YAML/JSON based schema definition for enforcing tight constraints on incoming data.

The bz2, gzip, and zlib compression libraries are already built into the language.

Pygments ( http://pygments.org/ ) is python based syntax highlighter with builtin support for YAML.

Advantages ----------

* The format is simple enough to hand edit or to have lightweight applications emit valid pickles. For example:

print('Todo: [go to bank, pick up food, write code]') # valid pickle

* To date, efforts to make pickles smaller have focused on creating new codes for every data type. Instead, we can use the simple text formatting of YAML and let general purpose data compression utilities do their job (letting the user control the trade-offs between speed, space, and human readability):

yaml.dump(data, compressor=None) # fast, human readable, no compression yaml.dump(data, compressor=bz2) # slowest, but best compression yaml.dump(data, compressor=zlib) # medium speed and medium compression

* The current pickle tools makes it easy to exchange object trees between two Python processes. The new tool would make it equally easy to exchange object trees between processes running any of Python, Ruby, Java, C/C++, Perl, C#, PHP, OCaml, Javascript, ActionScript, and Haskell.

* Ability to use a schema for enforcing a given object model and allowing full security. Which would you rather run on untrusted data:

data = yaml.load(myfile, schema=ListOfStrings)

or

data = pickle.load(myfile)

* Specification of a schema using YAML itself

ListOfStrings (a schema written in yaml) ........................................ type: seq sequence: - type: str

Sample of valid input ..................... - foo - bar - baz

Note, schemas can be defined for very complex, nested object models and allow many kinds of constraints (unique items, enumerated list of allowable values, min/max allowable ranges for values, data type, maximum length, and names of regular Python classes that can be constructed).

* YAML is a superset of JSON, so the schema validation also works equally well with JSON encoded data.

What needs to be done ---------------------

* Combine the tools for a single, clean interface to C speed parsing of a data serialization standard, with optional compression, schema validation, and pretty printing.

Just to add to this, I remembered someone recently did a simple benchmark of thift/JSON/YAML/Protocol Buffers, here are the links:

http://www.bouncybouncy.net/ramblings/posts/thrift_and_protocol_buffers/ http://www.bouncybouncy.net/ramblings/posts/more_on_json_vs_thrift_and_proto... http://www.bouncybouncy.net/ramblings/posts/json_vs_thrift_and_protocol_buff...

Without digging into the numbers too much, it's worth noting that PyYAML is written in pure python but also has Libyaml (http://pyyaml.org/wiki/LibYAML) bindings for speed. When I get a chance, I can run the same test(s) with both the pure-python implementation and the libyaml one as well as see how much the speedup is.

Speaking of benchmarks, last night I took the first benchmark cited in the links above, and with some work I ran the same benchmark with PyYAML (pure python) and PyYAML with libyaml (the C version). The PyYAML -> libyaml bindings require Cython right now, but here are the numbers.

I removed thift and proto buffers, as I wanted to focus on YAML/JSON right now:

5000 total records (0.510s)

ser_json (0.030s) 718147 bytes ser_cjson (0.030s) 718147 bytes ser_yaml (6.230s) 623147 bytes ser_cyaml (2.040s) 623147 bytes

ser_json_compressed (0.100s) 292987 bytes ser_cjson_compressed (0.110s) 292987 bytes ser_yaml_compressed (6.310s) 291018 bytes ser_cyaml_compressed (2.140s) 291018 bytes

serde_json (0.050s) serde_cjson (0.050s) serde_yaml (19.020s) serde_cyaml (4.460s)

Running the second benchmark (the integer one) I see:

10000 total records (0.130s)

ser_json (0.040s) 680749 bytes ser_cjson (0.030s) 680749 bytes ser_yaml (8.250s) 610749 bytes ser_cyaml (3.040s) 610749 bytes

ser_json_compressed (0.100s) 124924 bytes ser_cjson_compressed (0.090s) 124924 bytes ser_yaml_compressed (8.320s) 121090 bytes ser_cyaml_compressed (3.110s) 121090 bytes

serde_json (0.060s) serde_cjson (0.070s) serde_yaml (24.190s) serde_cyaml (6.690s)

So yes, the pure python numbers for yaml (_yaml) are pretty bad; the libyaml (_cyaml) numbers are significantly improved, but not as fast as JSON/CJSON.

Saying "not as fast" is a bit misleading. Roughly 100x slower than json is a more precise description, and is one of the major reasons why a bunch of people stick with json rather than yaml.

...

Leonardo Santagada

3:02 a.m.

For starters I want to say that this is a great idea overall but why not: On Apr 21, 2009, at 7:02 PM, Raymond Hettinger wrote:

...

Josiah Carlson

5:31 a.m.

On Tue, Apr 21, 2009 at 6:02 PM, Leonardo Santagada <santagada@gmail.com> wrote:

...

+1 for json. I've had bad luck with yaml in the past, but only have good things to say about json. - Josiah

Jeremiah Dodds

April 2009

10:33 a.m.

On Wed, Apr 22, 2009 at 4:31 AM, Josiah Carlson <josiah.carlson@gmail.com>wrote:

...

Dirkjan Ochtman

10:12 a.m.

On 22/04/2009 03:02, Leonardo Santagada wrote:

...

Daniel Stutzbach

11:39 a.m.

On Tue, Apr 21, 2009 at 8:02 PM, Leonardo Santagada <santagada@gmail.com>wrote:

...

Daniel Stutzbach

11:53 a.m.

On Tue, Apr 21, 2009 at 5:02 PM, Raymond Hettinger <python@rcn.com> wrote:

...

Antoine Pitrou

8:38 p.m.

Raymond Hettinger <python@...> writes:

...

Do you mean the binary representation is already memory efficient enough? It doesn't sound like a disadvantage.

...

* it is a major security risk for untrusted inputs

Alexandre Vassalotti

11:14 p.m.

On Tue, Apr 21, 2009 at 6:02 PM, Raymond Hettinger <python@rcn.com> wrote:

...

Motivation ----------

Python's pickles use a custom format that has evolved over time but they have five significant disadvantages:

* it has lost its human readability and editability

...

* is doesn't compress well

...

From my experience with pickle, I doubt you can improve significantly

...

* it isn't interoperable with other languages * it doesn't have the ability to enforce a schema

Again, these are not part of pickle's design goals.

...

* it is a major security risk for untrusted inputs

There are way to fix this without replacing pickle. See the recipe in pickle documentation: http://docs.python.org/3.0/library/pickle.html#restricting-globals

...

New idea --------

Develop a solution using a mix of PyYAML, a python coded version of Kwalify, optional compression using bz2, gzip, or zlib, and pretty printing using pygments.

YAML ( http://yaml.org/spec/1.2/ ) is a language independent standard for data serialization.

PyYAML ( http://pyyaml.org/wiki/PyYAML ) is a full implementation of the YAML standard. It uses the YAML's application-specific tags and Python's own copy/reduce logic to provide the same power as pickle itself.

But how are you going to handle serialization of class instances in a language independent manner? Regards, -- Alexandre