[Python-ideas] An idea for a new pickling tool

Guido van Rossum guido at python.org
Wed Apr 22 05:08:18 CEST 2009

On Tue, Apr 21, 2009 at 5:56 PM, Raymond Hettinger <python at rcn.com> wrote:
>>> Python's pickles use a custom format that has evolved over time
>>> but they have five significant disadvantages:
>>> * it has lost its human readability and editability
>>> * is doesn't compress well
>> Really? Or do you mean "it doesn't have built-in compression support"
>> ? I don't expect that running bzip2 over a pickle would produce
>> unsatisfactory results, and the API supports reading from and writing
>> to streams.
>> Or do you just mean "the representation is too repetitive and bulky" ?
>>> * it isn't interoperable with other languages
>>> * it doesn't have the ability to enforce a schema
>>> * it is a major security risk for untrusted inputs
>> I agree that pickle doesn't satisfy these. But then again, #1, #3 and
>> #4 were never part of its design goals. #5 is indeed a problem.
> Pickle does well with its original design goal.
> It would be really nice if we also provided a builtin solution that
> incorportated the other design goals listed above and adopted a format based
> on a published
> standard.
>> But I think there are existing solutions already. For example, I'd say
>> that XML+bzip2 satisfies all these already.
> No doubt that would work.  There is however a pretty high barrier to
> bringing
> together all the right tools  (an xml pickler/unpickler providing the
> equivalent of pickle.dumps/pickle.loads, a fast xml parser, a xml schema
> validator, an  xml pretty printer, and data compression).  Even with the
> right tools brought together under a single convenient API, it wouldn't be
> any fun to write the
> DTDs for the validator.  I think the barrier is so high, that in practice
> these tools
> will rarely be brought together for this purpose and instead are focused on
> ad hoc approaches geared to a particular application.
>> If you want something a
>> little less verbose, I recommend looking at Google Protocol Buffers
>> (http://code.google.com/apis/protocolbuffers/),
> That is a nice package.  It seems to have shared several of the goals
> listed above (interlanguage data exchange, use of schemas, and security).
> I scanned through all of the docs but didn't see a pickle.dumps() style API;
> instead, it seems to be focused on making the user build up parts of a
> non-subclassable
> custom  object that knows how to serialize itself.  In contrast, pyyaml
> rides on our
> existing __reduce__ logic to fully emulate what pickle can do (meaning that
> most apps can add serialization with just a single line).

Right. That is not one of the design goals. (It also generally is
incompatible with several other design goals, like cross-language
support and schema enforcement -- though I now realize you mean the
latter to be optional.)

> It doesn't look like the actual data formatting is based on a published
> standard
> so it requires the Google tool on each end (with support offered for Python,
> Java, and C++).

I'm not too worried about the "published standard" thing. Python
itself doesn't have anything like it either. :-) If you want real
enterprise-level standards compliance, I doubt that anything short of
XML will satisfy those die-hard conservatives. (And they probably
haven't even heard of bzip2.) I don't actually think there's such a
thing as a YAML standard either.

> Hope you're not negative on the idea of a compressing, validating, pretty
> printing, yaml pickler.
> Without your support, the idea is dead before it can get started.

You have my full support -- just very few of my cycles in getting
something working. I think there are enough people around here to
help. I'm skeptical that trying to create something new, whether
standards-based or not, is going to be worth it -- you have to be
careful to define your target audience and the design goals and see if
your solution would actually be enticing for that audience compared to
what they can do today. (Hence my plug for Protocol Buffers -- but
I'll stop now.)

> FWIW, I found some of the Kwalify examples to be compelling.  Am attaching
> one
> for you guys to look at.  I don't find think an equivalent XML solution
> would come
> together as effortlessly or as beautifully.  From a python point of view,
> the example boils down to:  yaml.dump(donors, file) and donors =
> yaml.load(file, schema=donor_schema).
> No extra work is required.

How easy is it to define a schema though? What about schema migration?
(An explicit goal of Protocol Buffers BTW, and in my experience very

> Human readability/editability comes for free,

How important is that though?

> inter-language operability comes for free, and so do the security
> guarantees.
> I think it would be great if we took a batteries included approach and
> offered
> something like this as part of the standard library.

First you have to have working code as a 3rd party package with a lot
of happy users.

> Raymond
> ----------- donor_schema ----------
> type:      seq
> sequence:
>  -
>   type:      map
>   mapping:
>    "name":
>       type:       str
>       required:   yes
>    "email":
>       type:       str
>       required:   yes
>       pattern:    /@/
>    "password":
>       type:       text
>       length:     { max: 16, min: 8 }
>    "age":
>       type:       int
>       range:      { max: 30, min: 18 }
>       # or assert: 18 <= val && val <= 30
>    "blood":
>       type:       str
>       enum:       [A, B, O, AB]
>    "birth":
>       type:       date
>    "deleted":
>       type:       bool
>       default:    false
> ----------- valid document ------------ name:     foo
>  email:    foo at mail.com
>  password: xxx123456
>  age:      20
>  blood:    A
>  birth:    1985-01-01
> - name:     bar
>  email:    bar at mail.net
>  age:      25
>  blood:    AB
>  birth:    1980-01-01

--Guido van Rossum (home page: http://www.python.org/~guido/)

More information about the Python-ideas mailing list