[Python-ideas] An idea for a new pickling tool
Raymond Hettinger
python at rcn.com
Wed Apr 22 02:56:25 CEST 2009
>> Python's pickles use a custom format that has evolved over time
>> but they have five significant disadvantages:
>>
>> * it has lost its human readability and editability
>> * is doesn't compress well
>
> Really? Or do you mean "it doesn't have built-in compression support"
> ? I don't expect that running bzip2 over a pickle would produce
> unsatisfactory results, and the API supports reading from and writing
> to streams.
>
>Or do you just mean "the representation is too repetitive and bulky" ?
>
>> * it isn't interoperable with other languages
>> * it doesn't have the ability to enforce a schema
>> * it is a major security risk for untrusted inputs
>
> I agree that pickle doesn't satisfy these. But then again, #1, #3 and
> #4 were never part of its design goals. #5 is indeed a problem.
Pickle does well with its original design goal.
It would be really nice if we also provided a builtin solution that incorportated
the other design goals listed above and adopted a format based on a published
standard.
>But I think there are existing solutions already. For example, I'd say
> that XML+bzip2 satisfies all these already.
No doubt that would work. There is however a pretty high barrier to bringing
together all the right tools (an xml pickler/unpickler providing the equivalent of
pickle.dumps/pickle.loads, a fast xml parser, a xml schema validator, an xml
pretty printer, and data compression). Even with the right tools brought
together under a single convenient API, it wouldn't be any fun to write the
DTDs for the validator. I think the barrier is so high, that in practice these tools
will rarely be brought together for this purpose and instead are focused on
ad hoc approaches geared to a particular application.
> If you want something a
> little less verbose, I recommend looking at Google Protocol Buffers
> (http://code.google.com/apis/protocolbuffers/),
That is a nice package. It seems to have shared several of the goals
listed above (interlanguage data exchange, use of schemas, and security).
I scanned through all of the docs but didn't see a pickle.dumps() style API;
instead, it seems to be focused on making the user build up parts of a non-subclassable
custom object that knows how to serialize itself. In contrast, pyyaml rides on our
existing __reduce__ logic to fully emulate what pickle can do (meaning that
most apps can add serialization with just a single line).
It doesn't look like the actual data formatting is based on a published standard
so it requires the Google tool on each end (with support offered for Python, Java, and C++).
Hope you're not negative on the idea of a compressing, validating, pretty printing, yaml pickler.
Without your support, the idea is dead before it can get started.
FWIW, I found some of the Kwalify examples to be compelling. Am attaching one
for you guys to look at. I don't find think an equivalent XML solution would come
together as effortlessly or as beautifully. From a python point of view, the example boils
down to: yaml.dump(donors, file) and donors = yaml.load(file, schema=donor_schema).
No extra work is required. Human readability/editability comes for free, inter-language
operability comes for free, and so do the security guarantees.
I think it would be great if we took a batteries included approach and offered
something like this as part of the standard library.
Raymond
----------- donor_schema ----------
type: seq
sequence:
-
type: map
mapping:
"name":
type: str
required: yes
"email":
type: str
required: yes
pattern: /@/
"password":
type: text
length: { max: 16, min: 8 }
"age":
type: int
range: { max: 30, min: 18 }
# or assert: 18 <= val && val <= 30
"blood":
type: str
enum: [A, B, O, AB]
"birth":
type: date
"deleted":
type: bool
default: false
----------- valid document ------------ name: foo
email: foo at mail.com
password: xxx123456
age: 20
blood: A
birth: 1985-01-01
- name: bar
email: bar at mail.net
age: 25
blood: AB
birth: 1980-01-01
More information about the Python-ideas
mailing list