[Python-ideas] An idea for a new pickling tool

Raymond Hettinger python at rcn.com
Wed Apr 22 02:56:25 CEST 2009


>> Python's pickles use a custom format that has evolved over time
>> but they have five significant disadvantages:
>>
>> * it has lost its human readability and editability
>> * is doesn't compress well
>
> Really? Or do you mean "it doesn't have built-in compression support"
> ? I don't expect that running bzip2 over a pickle would produce
> unsatisfactory results, and the API supports reading from and writing
> to streams.
>
>Or do you just mean "the representation is too repetitive and bulky" ?
>
>> * it isn't interoperable with other languages
>> * it doesn't have the ability to enforce a schema
>> * it is a major security risk for untrusted inputs
>
> I agree that pickle doesn't satisfy these. But then again, #1, #3 and
> #4 were never part of its design goals. #5 is indeed a problem.

Pickle does well with its original design goal.

It would be really nice if we also provided a builtin solution that incorportated 
the other design goals listed above and adopted a format based on a published
standard.


>But I think there are existing solutions already. For example, I'd say
> that XML+bzip2 satisfies all these already.

No doubt that would work.  There is however a pretty high barrier to bringing
together all the right tools  (an xml pickler/unpickler providing the equivalent of 
pickle.dumps/pickle.loads, a fast xml parser, a xml schema validator, an  xml 
pretty printer, and data compression).  Even with the right tools brought 
together under a single convenient API, it wouldn't be any fun to write the
DTDs for the validator.  I think the barrier is so high, that in practice these tools
will rarely be brought together for this purpose and instead are focused on
ad hoc approaches geared to a particular application.


> If you want something a
> little less verbose, I recommend looking at Google Protocol Buffers
> (http://code.google.com/apis/protocolbuffers/), 

That is a nice package.  It seems to have shared several of the goals
listed above (interlanguage data exchange, use of schemas, and security).

I scanned through all of the docs but didn't see a pickle.dumps() style API;
instead, it seems to be focused on making the user build up parts of a non-subclassable
custom  object that knows how to serialize itself.  In contrast, pyyaml rides on our
existing __reduce__ logic to fully emulate what pickle can do (meaning that
most apps can add serialization with just a single line). 

It doesn't look like the actual data formatting is based on a published standard
so it requires the Google tool on each end (with support offered for Python, Java, and C++).

Hope you're not negative on the idea of a compressing, validating, pretty printing, yaml pickler.
Without your support, the idea is dead before it can get started.

FWIW, I found some of the Kwalify examples to be compelling.  Am attaching one
for you guys to look at.  I don't find think an equivalent XML solution would come
together as effortlessly or as beautifully.  From a python point of view, the example boils 
down to:  yaml.dump(donors, file) and donors = yaml.load(file, schema=donor_schema).
No extra work is required.  Human readability/editability comes for free, inter-language 
operability comes for free, and so do the security guarantees.

I think it would be great if we took a batteries included approach and offered
something like this as part of the standard library.


Raymond


----------- donor_schema ----------
type:      seq
sequence:
  -
    type:      map
    mapping:
     "name":
        type:       str
        required:   yes
     "email":
        type:       str
        required:   yes
        pattern:    /@/
     "password":
        type:       text
        length:     { max: 16, min: 8 }
     "age":
        type:       int
        range:      { max: 30, min: 18 }
        # or assert: 18 <= val && val <= 30
     "blood":
        type:       str
        enum:       [A, B, O, AB]
     "birth":
        type:       date
     "deleted":
        type:       bool
        default:    false
----------- valid document ------------ name:     foo
  email:    foo at mail.com
  password: xxx123456
  age:      20
  blood:    A
  birth:    1985-01-01
- name:     bar
  email:    bar at mail.net
  age:      25
  blood:    AB
  birth:    1980-01-01




More information about the Python-ideas mailing list