Re: [Python-ideas] Adding a safe alternative to pickle in the standard library

23 Feb 2013

      On Sat, Feb 23, 2013 at 7:37 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
...
Devin Jeanpierre writes:
Nobody's saying we shouldn't bother with security.  Any answer needs
to be informed by the recognition that nothing we can design is proof
against the Sufficiently Stupid/Lazy User, that's all I'm trying to
say.
Sorry. Fair enough.
...
But security probably does have a cost in terms of inconvenience and
restriction on capabilities.  My question is "given that people can
and will do stupid things with relatively safe libraries like json,
what is the point of providing something intermediate between json and
pickle?"  In more detail, what features can we provide that don't
involve the known risks of pickle that would be sufficiently
attractive to users that they don't go to pickle anyway?
I believe that the features I'm suggesting meet that criterion (but
see below for discussion of risk).

Nothing will ever be sufficient to drive away all unwarranted use of
pickle, but I feel like these two features are really big ones that
would go a long way towards making the secure thing almost as easy in
almost every circumstance. As long as I've ever personally wanted,
although I can't speak for others.
...
You mention handling cycles, which adds minimal risk (unprepared code
could infloop on the unpacked data, but that's not the serializer's
fault), but also "new" types which isn't clear to me.  If you mean new
built-in types, can't the json module be extended?  (That would apply
to cycles as well, since we know it's possible it should be
automatable.)
It can. This brings up an interesting point. YAML already extends JSON
with cycle support (via aliases) and support for a notation for
marking up nonstandard types (via tagging). For example:

    >>> yaml.load('&mydict {"a": !!python/tuple ["b", *mydict]}')
    {'a': ('b', {...})}

PyYAML is useless security-wise, but if we're going to extend the json
module, this would probably be the direction to go.
...
If you mean user-defined types, we're back where we
started, with merely unpacking data running code whose provenance we
don't know.
That actually isn't where we started. We started with a serialization
format that includes such data as
""c__builtin__\neval\n(c__builtin__\nraw_input\n(S'py> '\ntRtR." (try
running pickle.loads on that in Python 2).

What I had in mind from the start was something where only whitelisted
constructors are used to reconstitute python values from the
serialized code. Then we're moved from trusting the input, to trusting
the competence of authors of our objects in modules that we imported.
In cerealizer there is a global registry of classes that profess to
handle input securely. Obviously, they might be wrong, and maybe a
user of a serialization library would want to provide a much smaller
whitelist. Maybe even the bigger whitelist should be disabled by
default, if we really want to be careful, and there should be a
security warning in the docs if you try to use the global registry.

So for example, there's the following things:

    # nominally safe; module authors only register if they believe
    # their deserialization code is safe even with untrusted input
    my_unserializer.loads("...",
whitelist=my_unserializer.PSEUDOSAFE_GLOBAL_REGISTRY)

    # nominally safe; if not, then a security bug in python
    my_unserializer.loads("...", whitelist=set())

-- Devin