Adding a safe alternative to pickle in the standard library

I've been noticing a lot of security-related issues being discussed in the Python world since the Ruby YAML problemcame out. Is it time to consider adding an alternative to pickle that is safe(r) by default? Pickle is usable in situations few other things are, because it can handle cyclic references and virtually any python object. The only stdlib alternative I'm aware of is json, which can do neither of those things. (Or at least, not without significant extra serialization code.) I would imagine that any alternative supplied should be easy enough to use that pickle users would seriously consider switching, and include at least those features. The benefit of using a secure alternative to pickle is that it increases the difficulty of creating an insecure application, even for those that are aware of the risks of the pickle module. With the pickle module, you are one mistake away from an insecure program: all you need is to have a way for the attacker to influence input to pickle. With a secure alternative, even if you make that mistake, it doesn't immediately result in a compromised application. You would need another mistake on top of that that results in the deserialized input being used improperly. The only third party library I'm aware of that attempts to be a safe/usable pickle replacement is cerealizer[1]_. Would it be worth considering adding cerealizer, or something like it, to the stdlib? .. [1]: http://home.gna.org/oomadness/en/cerealizer/index.html -- Devin

Le Thu, 21 Feb 2013 06:01:19 -0500, Devin Jeanpierre <jeanpierreda@gmail.com> a écrit :
I've been noticing a lot of security-related issues being discussed in the Python world since the Ruby YAML problemcame out. Is it time to consider adding an alternative to pickle that is safe(r) by default?
There's already json. Is something else needed? Regards Antoine.

On Thu, Feb 21, 2013 at 6:11 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
I've been noticing a lot of security-related issues being discussed in the Python world since the Ruby YAML problemcame out. Is it time to consider adding an alternative to pickle that is safe(r) by default?
There's already json. Is something else needed?
json can't handle cyclic references, and can't handle arbitrary python types. Even if you pass in a custom default and object_pairs_hook to json.dump and json.load respectively, it is impossible to serialize a subclass of (e.g.) dict as anything except the way dict is serialized, which will generally be incorrect. Even if this is changed, creating custom hooks in default and object_pairs_hook is a lot of work compared to using pickle (or, indeed, cerealizer), which handles this automatically. In some circumstances using pickle is clearly the wrong choice (e.g. storing data in cookies), but at the same time it is easier to do the wrong thing than the right thing. -- Devin

On 21 February 2013 08:47, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
On Thu, Feb 21, 2013 at 6:11 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
I've been noticing a lot of security-related issues being discussed in the Python world since the Ruby YAML problemcame out. Is it time to consider adding an alternative to pickle that is safe(r) by default?
There's already json. Is something else needed?
json can't handle cyclic references, and can't handle arbitrary python types. Even if you pass in a custom default and object_pairs_hook to json.dump and json.load respectively, it is impossible to serialize a subclass of (e.g.) dict as anything except the way dict is serialized, which will generally be incorrect.
Even if this is changed, creating custom hooks in default and object_pairs_hook is a lot of work compared to using pickle (or, indeed, cerealizer), which handles this automatically.
In some circumstances using pickle is clearly the wrong choice (e.g. storing data in cookies), but at the same time it is easier to do the wrong thing than the right thing.
Do you think a couple hleper functions to json could help? Funcitons that would translate a complex Python object into a dictionary, containing all type information, and object metadata - still yield a simple dictionary. (instance attributes of the root object would be under that dictionary's "__dict__" key, for example) Cyclic reference would require something more complex - but just these could allow one to json serialize arbitrary objects. Maybe these helper functions could be in the json module itself.
-- Devin _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

On 2/21/2013 6:11 AM, Antoine Pitrou wrote:
Le Thu, 21 Feb 2013 06:01:19 -0500, Devin Jeanpierre <jeanpierreda@gmail.com> a écrit :
I've been noticing a lot of security-related issues being discussed in the Python world since the Ruby YAML problemcame out. Is it time to consider adding an alternative to pickle that is safe(r) by default?
There's already json. Is something else needed?
As stated elsewhere, it's cycles and especially arbitrary python objects that are the big draw for pickle. I've always wanted a version of pickle.loads() that takes a list of classes that are allowed to be instantiated. Often, when using pickle to serialize over say AMQP or some other transport, I know what classes I want to allow. Anything else is either a (not infrequent) logic error or an attack of some sort. I realize this isn't perfect, but it would certainly reduce the attack surface for many of my use cases. I'm already authenticating the sender, and when I'm really paranoid I also sign the pickles. Just a thought. -- Eric.

On Thursday 21 Feb 2013, Eric V. Smith wrote:
On 2/21/2013 6:11 AM, Antoine Pitrou wrote:
Le Thu, 21 Feb 2013 06:01:19 -0500, Devin Jeanpierre <jeanpierreda@gmail.com>
a écrit :
I've been noticing a lot of security-related issues being discussed in the Python world since the Ruby YAML problemcame out. Is it time to consider adding an alternative to pickle that is safe(r) by default?
There's already json. Is something else needed?
As stated elsewhere, it's cycles and especially arbitrary python objects that are the big draw for pickle.
I've always wanted a version of pickle.loads() that takes a list of classes that are allowed to be instantiated. Often, when using pickle to serialize over say AMQP or some other transport, I know what classes I want to allow. Anything else is either a (not infrequent) logic error or an attack of some sort.
I realize this isn't perfect, but it would certainly reduce the attack surface for many of my use cases. I'm already authenticating the sender, and when I'm really paranoid I also sign the pickles.
Just a thought.
Is this not better solved by other methods? I.e. wasteful, but effective would be to send it all by XML.

On 2/21/2013 8:39 AM, Mark Hackett wrote:
On Thursday 21 Feb 2013, Eric V. Smith wrote:
On 2/21/2013 6:11 AM, Antoine Pitrou wrote:
Le Thu, 21 Feb 2013 06:01:19 -0500, Devin Jeanpierre <jeanpierreda@gmail.com>
a écrit :
I've been noticing a lot of security-related issues being discussed in the Python world since the Ruby YAML problemcame out. Is it time to consider adding an alternative to pickle that is safe(r) by default?
There's already json. Is something else needed?
As stated elsewhere, it's cycles and especially arbitrary python objects that are the big draw for pickle.
I've always wanted a version of pickle.loads() that takes a list of classes that are allowed to be instantiated. Often, when using pickle to serialize over say AMQP or some other transport, I know what classes I want to allow. Anything else is either a (not infrequent) logic error or an attack of some sort.
I realize this isn't perfect, but it would certainly reduce the attack surface for many of my use cases. I'm already authenticating the sender, and when I'm really paranoid I also sign the pickles.
Just a thought.
Is this not better solved by other methods? I.e. wasteful, but effective would be to send it all by XML.
Sure. I could write a serializer (to XML or whatever) that handles graphs of arbitrary python objects, but then I'm duplicating most of what pickle does. I'd rather leverage all of the work that pickle represents. Maybe I'll write a patch just to see what's involved. -- Eric.

Le Thu, 21 Feb 2013 08:32:47 -0500, "Eric V. Smith" <eric@trueblade.com> a écrit :
On 2/21/2013 6:11 AM, Antoine Pitrou wrote:
Le Thu, 21 Feb 2013 06:01:19 -0500, Devin Jeanpierre <jeanpierreda@gmail.com> a écrit :
I've been noticing a lot of security-related issues being discussed in the Python world since the Ruby YAML problemcame out. Is it time to consider adding an alternative to pickle that is safe(r) by default?
There's already json. Is something else needed?
As stated elsewhere, it's cycles and especially arbitrary python objects that are the big draw for pickle.
Of course, but it's being powerful which also makes pickle dangerous.
I've always wanted a version of pickle.loads() that takes a list of classes that are allowed to be instantiated.
Is the following enough for you: http://docs.python.org/3.4/library/pickle.html#restricting-globals ? Regards Antoine.

On 2/21/2013 9:00 AM, Antoine Pitrou wrote:
Le Thu, 21 Feb 2013 08:32:47 -0500, "Eric V. Smith" <eric@trueblade.com> a écrit :
On 2/21/2013 6:11 AM, Antoine Pitrou wrote:
Le Thu, 21 Feb 2013 06:01:19 -0500, Devin Jeanpierre <jeanpierreda@gmail.com> a écrit :
I've been noticing a lot of security-related issues being discussed in the Python world since the Ruby YAML problemcame out. Is it time to consider adding an alternative to pickle that is safe(r) by default?
There's already json. Is something else needed?
As stated elsewhere, it's cycles and especially arbitrary python objects that are the big draw for pickle.
Of course, but it's being powerful which also makes pickle dangerous.
I've always wanted a version of pickle.loads() that takes a list of classes that are allowed to be instantiated.
Is the following enough for you: http://docs.python.org/3.4/library/pickle.html#restricting-globals ?
Indeed, it is. Thanks for pointing it out! I've never gotten past the module interface part of the docs. Maybe the warning at the top of the page could also mention that there are ways to mitigate the safety concerns, and point to #restricting-globals? -- Eric.

Le Thu, 21 Feb 2013 09:11:20 -0500, "Eric V. Smith" <eric@trueblade.com> a écrit :
On 2/21/2013 9:00 AM, Antoine Pitrou wrote:
Le Thu, 21 Feb 2013 08:32:47 -0500, "Eric V. Smith" <eric@trueblade.com> a écrit :
On 2/21/2013 6:11 AM, Antoine Pitrou wrote:
Le Thu, 21 Feb 2013 06:01:19 -0500, Devin Jeanpierre <jeanpierreda@gmail.com> a écrit :
I've been noticing a lot of security-related issues being discussed in the Python world since the Ruby YAML problemcame out. Is it time to consider adding an alternative to pickle that is safe(r) by default?
There's already json. Is something else needed?
As stated elsewhere, it's cycles and especially arbitrary python objects that are the big draw for pickle.
Of course, but it's being powerful which also makes pickle dangerous.
I've always wanted a version of pickle.loads() that takes a list of classes that are allowed to be instantiated.
Is the following enough for you: http://docs.python.org/3.4/library/pickle.html#restricting-globals ?
Indeed, it is. Thanks for pointing it out! I've never gotten past the module interface part of the docs. Maybe the warning at the top of the page could also mention that there are ways to mitigate the safety concerns, and point to #restricting-globals?
Yes, that would be a good idea :-) Regards Antoine.

This conversation worries me. The security community has shown that safety isn't something you can add to a powerful tool. With great power comes great expressivity, and correspondingly more difficulty reasoning about it. Not to mention reasoning about yhe implementation. JSON is probably secure against code-execution exploits, but only probably. When you put something in the stdlib and call it "safe", even with caveats, people will make even more brazen mistakes than with a documented-unsafe tool like pickle. Dustin

On 2013-02-21, at 16:50 , Dustin J. Mitchell wrote:
This conversation worries me. The security community has shown that safety isn't something you can add to a powerful tool. With great power comes great expressivity, and correspondingly more difficulty reasoning about it. Not to mention reasoning about yhe implementation. JSON is probably secure against code-execution exploits, but only probably.
Considering there's no provision whatsoever in JSON itself for directing any kind of execution or programmatic-ish behavior (as opposed to YAML and — from what I understand — XML) why "only probably"? I could see JSON implementations having vulnerability and applications using JSON to do unsafe things (e.g. eval'ing JSON-sourced strings), but JSON itself?

On 2/21/2013 11:19 AM, Masklinn wrote:
On 2013-02-21, at 16:50 , Dustin J. Mitchell wrote:
This conversation worries me. The security community has shown that safety isn't something you can add to a powerful tool. With great power comes great expressivity, and correspondingly more difficulty reasoning about it. Not to mention reasoning about yhe implementation. JSON is probably secure against code-execution exploits, but only probably. Considering there's no provision whatsoever in JSON itself for directing any kind of execution or programmatic-ish behavior (as opposed to YAML and — from what I understand — XML) why "only probably"?
I was going to say, "YAML the format does not include execution," but then I went to read the YAML spec about the !! notation, and I honestly have no idea what it means. YAML is scary... --Ned.
I could see JSON implementations having vulnerability and applications using JSON to do unsafe things (e.g. eval'ing JSON-sourced strings), but JSON itself? _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

On Thu, Feb 21, 2013 at 10:50 AM, Dustin J. Mitchell <dustin@v.igoro.us> wrote:
When you put something in the stdlib and call it "safe", even with caveats, people will make even more brazen mistakes than with a documented-unsafe tool like pickle.
Then how do we improve on the status quo? The best situation can't possibly be one in which the standard serialization tool allows for code injection exploits out of the box, by default, and where there is no reasonable alternative in the stdlib without such problems. To my ears, this objection is like objecting to the inclusion of raw_input. Surely people will make even more brazen mistakes with a so-called "safe" input method like raw_input, than with a documented-unsafe tool like input()? -- Devin

On Thursday 21 Feb 2013, Devin Jeanpierre wrote:
On Thu, Feb 21, 2013 at 10:50 AM, Dustin J. Mitchell <dustin@v.igoro.us> wrote:
When you put something in the stdlib and call it "safe", even with caveats, people will make even more brazen mistakes than with a documented-unsafe tool like pickle.
Then how do we improve on the status quo? The best situation can't possibly be one in which the standard serialization tool allows for code injection exploits out of the box, by default, and where there is no reasonable alternative in the stdlib without such problems.
By writing your application for its needs, not the needs of 10000 programs yet to be written and making the wrong assumption and putting it in a stdlib. If every problem could be solved with a stdlib call, there'd only have to be one programmer in the world...

Le Thu, 21 Feb 2013 17:22:47 +0000, Mark Hackett <mark.hackett@metoffice.gov.uk> a écrit :
On Thursday 21 Feb 2013, Devin Jeanpierre wrote:
On Thu, Feb 21, 2013 at 10:50 AM, Dustin J. Mitchell <dustin@v.igoro.us> wrote:
When you put something in the stdlib and call it "safe", even with caveats, people will make even more brazen mistakes than with a documented-unsafe tool like pickle.
Then how do we improve on the status quo? The best situation can't possibly be one in which the standard serialization tool allows for code injection exploits out of the box, by default, and where there is no reasonable alternative in the stdlib without such problems.
By writing your application for its needs, not the needs of 10000 programs yet to be written and making the wrong assumption and putting it in a stdlib.
If every problem could be solved with a stdlib call, there'd only have to be one programmer in the world...
You're forgetting the millions of stdlib programmers :-) Regards Antoine.

On Thursday 21 Feb 2013, Antoine Pitrou wrote:
Le Thu, 21 Feb 2013 17:22:47 +0000, Mark Hackett <mark.hackett@metoffice.gov.uk> a
If every problem could be solved with a stdlib call, there'd only have to be one programmer in the world...
You're forgetting the millions of stdlib programmers :-)
Regards
Antoine.
But only because 10,000 different stdlib calls are needed! (I should shush now, I don't want to give away the Programmer's Secret. Even if it IS very similar to the Lawyers Secret)... Being serious, though, if your code requires a serious amount of security, you're better off writing your own parsing. It does mean the poor guy taking over has to use this "stdlib", but that's more work for programmers. Oh no, I said it...! AaaarrrgghhCARRIER LOST

On 22/02/13 04:33, Mark Hackett wrote:
Being serious, though, if your code requires a serious amount of security, you're better off writing your own parsing.
If you're serious about security, you don't want amateurs trying to build security from scratch. And that includes yourself, if you are not a security expert. A programmer ought to be aware of their own limitations. I am not a security expert, and I don't have the time or inclination to become one. I want, no, I *need*, solutions for common problems to be safe by default, or at least for their vulnerabilities to be documented clearly and obviously in language I can understand, so I can write code with reasonable levels of security instead of inventing my own insecure, unsafe solutions. I know enough not to call eval() on data retrieved from untrusted sources. Not everyone even knows that much. I've seen code that literally downloaded content from a website, then eval'ed it without even a token attempt to sanitize it. Do you expect this person to write his own secure data serialiser? Anyone can write code with no security vulnerabilities that *they* can see. And frequently do. -- Steven

On Friday 22 Feb 2013, Steven D'Aprano wrote:
On 22/02/13 04:33, Mark Hackett wrote:
Being serious, though, if your code requires a serious amount of security, you're better off writing your own parsing.
If you're serious about security, you don't want amateurs trying to build security from scratch.
And if your code needs to be secure and you aren't capable of doing so, then pay someone to do it who is capable. You know, get a security expert.

Steven D'Aprano writes:
A programmer ought to be aware of their own limitations. I am not a security expert, and I don't have the time or inclination to become one. I want, no, I *need*, solutions for common problems to be safe by default, or at least for their vulnerabilities to be documented clearly and obviously in language I can understand, so I can write code with reasonable levels of security instead of inventing my own insecure, unsafe solutions.
Sure. So just use JSON where it will do, and avoid pickle. No? Sure, you can make a case that "restricted pickle" would give you a trivial upgrade path if you find you really need it later. But it seems to me that if you think you need a protocol that executes serialized code automatically, you've got a heck of a lot of security work to do, beside which the effort to port from JSON API to pickle API is tiny.

On Feb 21, 2013, at 9:24, Antoine Pitrou <solipsis@pitrou.net> wrote:
Le Thu, 21 Feb 2013 17:22:47 +0000, Mark Hackett <mark.hackett@metoffice.gov.uk> a écrit :
On Thursday 21 Feb 2013, Devin Jeanpierre wrote:
On Thu, Feb 21, 2013 at 10:50 AM, Dustin J. Mitchell <dustin@v.igoro.us> wrote:
When you put something in the stdlib and call it "safe", even with caveats, people will make even more brazen mistakes than with a documented-unsafe tool like pickle.
Then how do we improve on the status quo? The best situation can't possibly be one in which the standard serialization tool allows for code injection exploits out of the box, by default, and where there is no reasonable alternative in the stdlib without such problems.
By writing your application for its needs, not the needs of 10000 programs yet to be written and making the wrong assumption and putting it in a stdlib.
If every problem could be solved with a stdlib call, there'd only have to be one programmer in the world...
You're forgetting the millions of stdlib programmers :-)
This is one of those "any sufficiently powerful language becomes lisp" things, isn't it. :)
Regards
Antoine.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

How often have you needed either cyclic references or the ability to dynamically store arbitrary classes in something like a cookie or a cache file? It seems to me that when you're trying to do something that would be difficult to do with json (with the new helpers proposed by someone earlier in the thread), you're usually doing something too dangerous to store in a cookie anyway. If someone can construct instances of arbitrary classes, they can run code. If they can't, why did you need pickle? I realize the overlap is just small, not zero. But I think catering to it could be an attractive nuisance that would lead to more unsafe code rather than less. It's always easier to preserve safety while adding features to something simple, than to add safety while restricting features in something powerful. Sorry for the top post. Sent from a random iPhone On Feb 21, 2013, at 9:18, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
On Thu, Feb 21, 2013 at 10:50 AM, Dustin J. Mitchell <dustin@v.igoro.us> wrote:
When you put something in the stdlib and call it "safe", even with caveats, people will make even more brazen mistakes than with a documented-unsafe tool like pickle.
Then how do we improve on the status quo? The best situation can't possibly be one in which the standard serialization tool allows for code injection exploits out of the box, by default, and where there is no reasonable alternative in the stdlib without such problems.
To my ears, this objection is like objecting to the inclusion of raw_input. Surely people will make even more brazen mistakes with a so-called "safe" input method like raw_input, than with a documented-unsafe tool like input()?
-- Devin _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

From: Andrew Barnert How often have you needed either cyclic references or the ability to dynamically store arbitrary classes in something like a cookie or a cache file?
In a past life I used pickle regularly to snapshot long-running (evolutionary) algorithms that used user-provided classes and all sorts of highly improper circular references. And there are plenty of researchers out there using Python for much crazier things than I ever did. There is a lot more to Python than web apps... Steve

Steve Dower writes:
In a past life I used pickle regularly to snapshot long-running (evolutionary) algorithms that used user-provided classes
And how do you propose to prevent user-provided exploits, then? Nobody wants to take away the power of pickle if it imposes only risks you're happy to bear. The question here is "is it possible to be *safer* than pickle without giving up any of the power?"

On Thu, Feb 21, 2013 at 1:29 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Steve Dower writes:
In a past life I used pickle regularly to snapshot long-running (evolutionary) algorithms that used user-provided classes
And how do you propose to prevent user-provided exploits, then?
Just because an application has one place where someone can inject new code, doesn't mean it should have another. You might trust the people that write these evolutionary algorithm classes, but not trust people that give you snapshots of the algorithms running.
Nobody wants to take away the power of pickle if it imposes only risks you're happy to bear. The question here is "is it possible to be *safer* than pickle without giving up any of the power?"
I hope nobody is asking that question, because the answer is a strong no. Pickle's ability to call arbitrary objects accessible in any module, anywhere, is part of how powerful it is, but it is also a fundamental source of unsafety. That does not mean that we should not write or use safer alternatives. We have written and do use safer alternatives, like the json module. But it means we can't expect them to be usable exactly everywhere pickle is. I would've said the question is how far in that direction we should bother to go. How many features do you add before you're increasing risk from faulty code, rather than decreasing it by making it easier to use a secure-by-design library? -- Devin

Devin Jeanpierre writes:
That does not mean that we should not write or use safer alternatives. We have written and do use safer alternatives, like the json module.
Then why do we need a "safe alternative to pickle" when json is already in the standard library?
But it means we can't expect them to be usable exactly everywhere pickle is. I would've said the question is how far in that direction we should bother to go.
OK, that is a better way to put what I have in mind. Well, we've already gone as far as json, which is pretty powerful (but still subject to attacks using "relatively secure" json to transport "insecure" data!) Why do we need an alternative *between* pickle and json? Maybe we should advocate that users think seriously about securing channels, and validating the pickles before doing anything with them, if they think they need more features than json offers?

On Fri, Feb 22, 2013 at 7:29 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
That does not mean that we should not write or use safer alternatives. We have written and do use safer alternatives, like the json module.
Then why do we need a "safe alternative to pickle" when json is already in the standard library?
json can't handle cycles or new types -- at least, not without a rather significant amount of preprocessing and postprocessing that is little less work than writing your own serialization library from scratch. I do believe that support for these things is important. People do want to store frozensets of tuples, and objects which reference themselves, and so on. The easy way to do this is with pickle, and that encourages people to use pickle even where it is a bad idea. The easier it is to do something safer, the more likely it is people will do something safer. I've seen people use pickle even when advised that it was a security risk, because they didn't want to go through the effort of using the json module. To me that signals that something is wrong.
Well, we've already gone as far as json, which is pretty powerful (but still subject to attacks using "relatively secure" json to transport "insecure" data!)
Of course a serialization library can't protect against eval(deserialize(foo)) running arbitrary code. That doesn't mean we shouldn't bother with security.
Why do we need an alternative *between* pickle and json? Maybe we should advocate that users think seriously about securing channels, and validating the pickles before doing anything with them, if they think they need more features than json offers?
Signed pickles and secured channels and so on don't solve the problem of trying to get data from an an untrusted user. Moreover I think it isn't an uncommon opinion that, even in that case, it'd be better to use json or some other nominally secure library instead of pickle, since it increases the number of mistakes necessary to compromise your security. -- Devin

On Feb 22, 2013, at 5:26, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
Well, we've already gone as far as json, which is pretty powerful (but still subject to attacks using "relatively secure" json to transport "insecure" data!)
Of course a serialization library can't protect against eval(deserialize(foo)) running arbitrary code. That doesn't mean we shouldn't bother with security.
The difference is that json.loads is just deserialize(foo), which pickle.loads inherently has some eval mixed in. That's why I think for most use cases, the answer is making json easier to extend, not making pickle easier to secure. The biggest problem people have with the json library isn't that you have to do the extending explicitly and externally, but that it's a huge pain to do so. There was a suggestion earlier in this thread (I forget the author) that would go a long way toward relieving that pain. Some people also want it to be implicitly extensible, to have some way to create an instance of a new empty class named Foo with given attributes (but not an existing builtin or user-defined class named Foo). I'm not sure what their use case is, and I'm not sure it's a good idea--but if it is, there was also a suggestion for that.

On Fri, Feb 22, 2013 at 12:41 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
The difference is that json.loads is just deserialize(foo), which pickle.loads inherently has some eval mixed in.
That's why I think for most use cases, the answer is making json easier to extend, not making pickle easier to secure.
My original suggestion was to add a third thing, such as cerealizer, not to restrict pickle or extend json. Some others have talked about restricting pickle, but I don't know how one could do that and still be confident in the safety of the end product. You usually build things to be safe from the ground up, not as some afterthought with a few restrictions.
The biggest problem people have with the json library isn't that you have to do the extending explicitly and externally, but that it's a huge pain to do so. There was a suggestion earlier in this thread (I forget the author) that would go a long way toward relieving that pain.
I feel that'd be very helpful, yes. Obviously not as helpful as something that can handle cyclic references, but those aren't really as important. Besides which, a yaml module could synthesize something more complete out of these pieces (YAML is like JSON, but with support for cyclic references and some extra syntax). My issue is making safe serialization easier, so that not using pickle is a viable option. As you say, we can go a long way towards this using the json module. -- Devin

To take this back to the ideas stage, one idea might be to integrate hmac into pickle. At a minimum, provide some sample code showing how to wrap an hmac around a pickled object. Related to this I note that it would be helpful if the hmac docs gave some advice on key generation (e.g., suggested length). Or to be a little more convenient, add new methods like these: pickle.set_hmac_key([key]) Sets the key used when pickle hmacs are generated. The key is as expected by hmac.new. If key is not provided or this function is not called before using an hmac, a random key is generated that will vary each time the program is run. This is useful if you do not want pickles to be reusable between different runs of the program. pickle.dump_hmac(obj, file[, protocol]) Same as pickle.dump except that it attaches an hmac to the pickled data. pickle.dumps_hmac(obj[, protocol]) Same as pickle.dumps except that it attaches an hmac to the pickled data. pickle.load_hmac(file) Same as pickle.load except that it verifies and removes an hmac as attached by dump_hmac. Raises UnpicklingHmacError if the hmac cannot be verified. pickle.loads_hmac(string) Same as pickle.loads except that it (1) verifies and removes an hmac as attached by dump_hmac and (2) it does not ignore extra characters. Raises UnpicklingHmacError if the hmac cannot be verified. The reason I suggest setting the hmac key at a global level rather than in each call is that this eliminates the need for either passing around the key or generating keys at multiple points in the code. If a key were passed in each call, it would have the benefit that a program could use multiple keys to ensure that pickles from one part of the program were not unpickled in other parts. This seems like a heavy-handed use of the feature. The reason I suggest using new methods rather than adding a keyword arg to the current methods is that this facilitates wholesale replacement of pickle.dump with pickle.dump_hmac and I don't envision an explosion of variations. Usually I'm an advocate of doing it the other way around. :-) --- Bruce Latest blog post: Alice's Puzzle Page http://www.vroospeak.com

On Fri, Feb 22, 2013 at 11:08 AM, Bruce Leban <bruce@leapyear.org> wrote:
To take this back to the ideas stage, one idea might be to integrate hmac into pickle. At a minimum, provide some sample code showing how to wrap an hmac around a pickled object.
This sounds very much like the `itsdangerous` library (which uses JSON by default, but the serializer backend is pluggable): http://pythonhosted.org/itsdangerous/ Cheers, Chris

On Friday 22 Feb 2013, Devin Jeanpierre wrote:
On Fri, Feb 22, 2013 at 12:41 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
The difference is that json.loads is just deserialize(foo), which pickle.loads inherently has some eval mixed in.
That's why I think for most use cases, the answer is making json easier to extend, not making pickle easier to secure.
My original suggestion was to add a third thing, such as cerealizer,
Maybe we can ask Wil WHEATon... Apologies...

Devin Jeanpierre writes:
Well, we've already gone as far as json, which is pretty powerful (but still subject to attacks using "relatively secure" json to transport "insecure" data!)
Of course a serialization library can't protect against eval(deserialize(foo)) running arbitrary code. That doesn't mean we shouldn't bother with security.
Nobody's saying we shouldn't bother with security. Any answer needs to be informed by the recognition that nothing we can design is proof against the Sufficiently Stupid/Lazy User, that's all I'm trying to say. But security probably does have a cost in terms of inconvenience and restriction on capabilities. My question is "given that people can and will do stupid things with relatively safe libraries like json, what is the point of providing something intermediate between json and pickle?" In more detail, what features can we provide that don't involve the known risks of pickle that would be sufficiently attractive to users that they don't go to pickle anyway? You mention handling cycles, which adds minimal risk (unprepared code could infloop on the unpacked data, but that's not the serializer's fault), but also "new" types which isn't clear to me. If you mean new built-in types, can't the json module be extended? (That would apply to cycles as well, since we know it's possible it should be automatable.) If you mean user-defined types, we're back where we started, with merely unpacking data running code whose provenance we don't know.
Why do we need an alternative *between* pickle and json? Maybe we should advocate that users think seriously about securing channels, and validating the pickles before doing anything with them, if they think they need more features than json offers?
Signed pickles and secured channels and so on don't solve the problem of trying to get data from an an untrusted user.
Yeah yeah, sorry about the red herring. It's the first question that matters.

On Sat, Feb 23, 2013 at 7:37 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Devin Jeanpierre writes: Nobody's saying we shouldn't bother with security. Any answer needs to be informed by the recognition that nothing we can design is proof against the Sufficiently Stupid/Lazy User, that's all I'm trying to say.
Sorry. Fair enough.
But security probably does have a cost in terms of inconvenience and restriction on capabilities. My question is "given that people can and will do stupid things with relatively safe libraries like json, what is the point of providing something intermediate between json and pickle?" In more detail, what features can we provide that don't involve the known risks of pickle that would be sufficiently attractive to users that they don't go to pickle anyway?
I believe that the features I'm suggesting meet that criterion (but see below for discussion of risk). Nothing will ever be sufficient to drive away all unwarranted use of pickle, but I feel like these two features are really big ones that would go a long way towards making the secure thing almost as easy in almost every circumstance. As long as I've ever personally wanted, although I can't speak for others.
You mention handling cycles, which adds minimal risk (unprepared code could infloop on the unpacked data, but that's not the serializer's fault), but also "new" types which isn't clear to me. If you mean new built-in types, can't the json module be extended? (That would apply to cycles as well, since we know it's possible it should be automatable.)
It can. This brings up an interesting point. YAML already extends JSON with cycle support (via aliases) and support for a notation for marking up nonstandard types (via tagging). For example: >>> yaml.load('&mydict {"a": !!python/tuple ["b", *mydict]}') {'a': ('b', {...})} PyYAML is useless security-wise, but if we're going to extend the json module, this would probably be the direction to go.
If you mean user-defined types, we're back where we started, with merely unpacking data running code whose provenance we don't know.
That actually isn't where we started. We started with a serialization format that includes such data as ""c__builtin__\neval\n(c__builtin__\nraw_input\n(S'py> '\ntRtR." (try running pickle.loads on that in Python 2). What I had in mind from the start was something where only whitelisted constructors are used to reconstitute python values from the serialized code. Then we're moved from trusting the input, to trusting the competence of authors of our objects in modules that we imported. In cerealizer there is a global registry of classes that profess to handle input securely. Obviously, they might be wrong, and maybe a user of a serialization library would want to provide a much smaller whitelist. Maybe even the bigger whitelist should be disabled by default, if we really want to be careful, and there should be a security warning in the docs if you try to use the global registry. So for example, there's the following things: # nominally safe; module authors only register if they believe # their deserialization code is safe even with untrusted input my_unserializer.loads("...", whitelist=my_unserializer.PSEUDOSAFE_GLOBAL_REGISTRY) # nominally safe; if not, then a security bug in python my_unserializer.loads("...", whitelist=set()) -- Devin

Devin Jeanpierre writes:
I believe that the features I'm suggesting meet that criterion (but see below for discussion of risk).
OK, thanks for the discussion. I need to get back to $DAYJOB (at 1am on a Saturday :-P ), but I'll chew over that for a bit before interjecting again. ;-)

From: Stephen J. Turnbull [mailto:stephen@xemacs.org] Sent: Thursday, February 21, 2013 1030 To: Steve Dower Cc: Andrew Barnert; Devin Jeanpierre; Dustin J. Mitchell; Eric V. Smith; Python-Ideas Subject: Re: [Python-ideas] Adding a safe alternative to pickle in the standard library
Steve Dower writes:
In a past life I used pickle regularly to snapshot long-running > (evolutionary) algorithms that used user-provided classes
And how do you propose to prevent user-provided exploits, then?
Nobody wants to take away the power of pickle if it imposes only risks you're happy to bear. The question here is "is it possible to be *safer* than pickle without giving up any of the power?"
I was only aiming to provide perspective, rather than a proposal. The existing ability to customise the unpickler suited my needs perfectly, not that I actually used it. To make it safer I would have a restricted unpickler as the default (for load/loads) and force people to override it if they want to loosen restrictions. To be really explicit, I would make load/loads only work with built-in types. For compatibility when reading earlier protocols we could add a type representing a class instance and its members that doesn't actually construct anything. (Maybe we override __getattr__ to make it lazily construct the instance when the program actually wants to use it?) For convenience, I'd add a parameter to Unpickler to let the user provide a set of types that are allowed to be constructed (or a mapping from names to callables as used in find_class()). Finally, I'd give greater exposure to overriding find_class(), as has already been suggested. People who want to unpickle any arbitrary type have to do this to opt-in. (Maybe that's a little aggressive?) I'd expect this would come under a new protocol version, but I wouldn't be opposed to old data being unpickled under these rules, especially if the non-constructing type is actually a lazy-constructing type. Cheers, Steve

Steve Dower writes:
I was only aiming to provide perspective, rather than a proposal.
Sure, but I didn't think there was a need for more "general" perspective. Pickle is a well-established protocol with certain risks and certain benefits, of which the python-dev community is basically well-aware. This is coming up *now* because recent events (Ruby/YAML has been mentioned) are causing some people to reevaluate the risks. This is a "quantitative" issue, and needs concrete proposals.
To be really explicit, I would make load/loads only work with built-in types. For compatibility when reading earlier protocols we could add a type representing a class instance and its members that doesn't actually construct anything. (Maybe we override __getattr__ to make it lazily construct the instance when the program actually wants to use it?)
I am not a security expert, but it seems to me that's going in the wrong direction. Unpickler would *still* run constructor code automatically under some circumstances -- but those circumstances become murkier.
For convenience, I'd add a parameter to Unpickler to let the user provide a set of types that are allowed to be constructed (or a mapping from names to callables as used in find_class()).
And this is secure, why? There's no way to decorate the allowed types to add nasty stuff to the pickled class definitions (including built-in types), right? There are no bugs that allow a back door, right? Is the API sufficiently well-designed that users will easily figure out how to do what they need, and *only* what they need, and therefore won't be tempted to simply turn on permission to do *everything*? And they won't give up, and write their own? Isn't it better just to give users the advice to use JSON where it will do? Perhaps the difference in APIs will give them pause to think again if they're starting to think about unpickling classes? Granted, I don't have answers to those questions (except for myself!) But I think some thought should be given to them before trying to create a restricted pickle protocol and make it default. Restricted modes/protocols/sublanguages are hard to get right.

From: Stephen J. Turnbull [mailto:stephen@xemacs.org] Steve Dower writes:
To be really explicit, I would make load/loads only work with built-in types. For compatibility when reading earlier protocols we could add a type representing a class instance and its members that doesn't actually construct anything. (Maybe we override __getattr__ to make it lazily construct the instance when the program actually wants to use it?)
I am not a security expert, but it seems to me that's going in the wrong direction. Unpickler would *still* run constructor code automatically under some circumstances -- but those circumstances become murkier.
Agreed on the bit in parentheses, that's probably questionable enough to ignore. However, if it only works with built-in types then there is no user code that will run. IIUC we already have a C implementation of pickle that is immune to users redefining builtins (if not, we should do this too). Pickled objects would be unpickled as (effectively) a tuple of the members - ('system', ("echo Hello World")) does not execute any code. (And yeah, wrap that tuple up in a type that can be tested.)
For convenience, I'd add a parameter to Unpickler to let the user provide a set of types that are allowed to be constructed (or a mapping from names to callables as used in find_class()).
And this is secure, why? There's no way to decorate the allowed types to add nasty stuff to the pickled class definitions (including built-in types), right?
Code is only pickled by name. Unpickler resolves the names and returns the class or function reference in the current environment. If it can't find the module or name in its current environment, it raises an error.
There are no bugs that allow a back door, right?
Of course not. That's why we never see security patches or updates for operating systems or platforms. This is a silly argument.
Is the API sufficiently well-designed that users will easily figure out how to do what they need, and *only* what they need, and therefore won't be tempted to simply turn on permission to do *everything*?
All we can ever do is provide instructions to keep the developer safe and make it clear that ignoring those rules will reduce the security of their program. It's up to the developer to make the right decisions.
And they won't give up, and write their own?
In my experience, people more often write their own out of ignorance rather than frustration (same for unnecessarily using XML). Or they'll switch to an earlier version of Python that doesn't have this change in it. Again, we can encourage, but not dictate.
Isn't it better just to give users the advice to use JSON where it will do? Perhaps the difference in APIs will give them pause to think again if they're starting to think about unpickling classes?
Maybe, though since pickle is literally the Python equivalent (should it have been called Python Object Notation (PON)? Probably not...) we should be ensuring that it is the best it can be.
Granted, I don't have answers to those questions (except for myself!) But I think some thought should be given to them before trying to create a restricted pickle protocol and make it default. Restricted modes/protocols/sublanguages are hard to get right.
Agreed. I don't think we need a new protocol though, just a less permissive default implementation of Unpickler.find_class().

Steve Dower writes:
There are no bugs that allow a back door, right?
Of course not. That's why we never see security patches or updates for operating systems or platforms. This is a silly argument.
"Of course not." The right answer is that "it's been audited and we're as sure as we ever are." The problem is that (as Devin Jeanpierre wrote IIRC) pickle was not designed for security from the ground up. Removing execution of untrusted code from it may not be as easy as just saying so. Maybe it is.
Is the API sufficiently well-designed that users will easily figure out how to do what they need, and *only* what they need, and therefore won't be tempted to simply turn on permission to do *everything*?
All we can ever do is provide instructions to keep the developer safe and make it clear that ignoring those rules will reduce the security of their program. It's up to the developer to make the right decisions.
But that's not enough. What others have said here is that json doesn't get used when it would be perfectly suitable (and as secure as serialization gets!) because the API doesn't provide convenient access to the features they need.
Agreed. I don't think we need a new protocol though, just a less permissive default implementation of Unpickler.find_class().
The proof of the pudding is in the auditing, I guess.

On Feb 21, 2013, at 9:55, Steve Dower <Steve.Dower@microsoft.com> wrote:
From: Andrew Barnert How often have you needed either cyclic references or the ability to dynamically store arbitrary classes in something like a cookie or a cache file?
In a past life I used pickle regularly to snapshot long-running (evolutionary) algorithms that used user-provided classes and all sorts of highly improper circular references. And there are plenty of researchers out there using Python for much crazier things than I ever did.
There is a lot more to Python than web apps...
But you're not storing those pickles in a cookie, which is exactly my point. There are many cases where you need the power of pickle. There are also many cases where you need safe serialization. But there's not much overlap. There are plenty of cases where you need safety, and also need a little more power than JSON--but you still don't usually need the full power of pickle for those cases, and making it easier to extend the json lib is a much cleaner way forward than making it easier to restrict pickle. It's true that "not much overlap" != "no overlap". But you can't cover everything. If you're building, say, an online interactive python interpreter that saves and restores its state between sessions, you're going to have to think through the security implications. That doesn't mean someone who wants to just store scientific data and doesn't have untested sources, or someone building a web app who doesn't care about storing arbitrary dynamically defined types, should have the same burden.

On Thu, Feb 21, 2013 at 3:01 AM, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
I've been noticing a lot of security-related issues being discussed in the Python world since the Ruby YAML problemcame out. Is it time to consider adding an alternative to pickle that is safe(r) by default?
Pickle is usable in situations few other things are, because it can handle cyclic references and virtually any python object. The only stdlib alternative I'm aware of is json, which can do neither of those things. (Or at least, not without significant extra serialization code.) I would imagine that any alternative supplied should be easy enough to use that pickle users would seriously consider switching, and include at least those features.
Pickle is unsafe if you give it untrusted input. It's safe if you pickle something yourself and then unpickle it. If the problem is that you want to pickle something and store it in some unsafe place (like a cookie or a db under user control) and then read it back in later and unpickle it, then you can mitigate the risk by using an HMAC or some other mechanism to prevent tampering and may want to consider encrypting it too. That said, there is one risk in pickling something yourself and unpickling it later that you need to watch out for. If your objects change, then unpickling might produce unexpected and even potentially unsafe results. You can mitigate this by adding object versions to your objects (as long as you don't forget to update that when the object changes). There's another problem - pickling is not guaranteed to work across Python versions. So you may find yourself having to read pickles that are no longer readable in a future python version. Not a problem for cookies, but a potential headache with long-lived pickles. All of this leads me to suggest using a better format for this problem. Json is a reasonable choice (I've used it myself) although I would still use an HMAC. If you encrypt it then that makes attacking the object that much harder. I'd advise against using your own format. I wrote a tutorial on hacking web sites called Gruyere <http://j.mp/gruyere-security>. I suggest reading the section on cookies http://j.mp/learn-state-manipulation (although to be honest, I recommend reading the whole thing :-) Aside from security, using a format like json encourages you to think about what belongs in the persisted object and what doesn't. Suppose your object includes a url. If you pickle it, you may end up persisting the parsed url with a dictionary of parameters and other unnecessary overhead. When you convert to json, you're going to just copy the url. On Thu, Feb 21, 2013 at 7:50 AM, Dustin J. Mitchell <dustin@v.igoro.us>wrote:
This conversation worries me. The security community has shown that safety isn't something you can add to a powerful tool. With great power comes great expressivity, and correspondingly more difficulty reasoning about it. Not to mention reasoning about yhe implementation. JSON is probably secure against code-execution exploits, but only probably.
When you put something in the stdlib and call it "safe", even with caveats, people will make even more brazen mistakes than with a documented-unsafe tool like pickle.
Yes indeed. --- Bruce Latest blog post: Alice's Puzzle Page http://www.vroospeak.com

On 22/02/13 10:39, Bruce Leban wrote:
There's another problem - pickling is not guaranteed to work across Python versions.
I think you are confusing pickle with marshal. Obviously you cannot guarantee unpickling in older versions of Python, e.g. I can't unpickle a Python 2.7 set in Python 2.1, since sets didn't exist back then. Or a pickle created using version N of the protocol requires a Python version recent enough to understand version N, e.g. protocol 3 exists only in Python 3.x. But going forward should not be a problem, and the pickle documentation promises backwards-compatibility. Actually I believe that *forward* compatibility is a better description, but either way, the compatibility is guaranteed: Python version X will be able to unpickle anything pickled by an older version of Python; furthermore, anything pickled by version X will be able to be unpickled by future versions. (Modulo removal of the pickled classes, etc.) -- Steven

On Thu, Feb 21, 2013 at 10:55 PM, Steven D'Aprano <steve@pearwood.info>wrote:
On 22/02/13 10:39, Bruce Leban wrote:
There's another problem - pickling is not guaranteed to work across Python
versions.
I think you are confusing pickle with marshal.
My bad. Should have rtfm. Thanks. --- Bruce Follow me: http://www.twitter.com/Vroo http://www.vroospeak.com

Devin Jeanpierre <jeanpierreda@...> writes:
Pickle is usable in situations few other things are, because it can handle cyclic references and virtually any python object. The only stdlib alternative I'm aware of is json, which can do neither of those things. (Or at least, not without significant extra serialization code.) I would imagine that any alternative supplied should be easy enough to use that pickle users would seriously consider switching, and include at least those features.
What I'm about to mention may or may not meet your use case, but there is a way in which JSON can support cyclic references and virtually any Python object. Not without *some* work and *some* limitations, obviously, but then, TANSTAAFL :-) In logging.dictConfig, we use a dict to configure logging, but the underlying mechanism is more general. It allows you to construct objects outside the dict using a syntax such as 'ext://sys.stderr', and objects inside the dict using syntax such as 'cfg://xyz.abc[1]'. To illustrate, consider the following JSON: { "v1": "ext://sys.stderr", "v2": "cfg://v3.some_list[1]", "v3": { "some_list": [1, 2, 3], "some_value": "cfg://v4.key" }, "v4": { "key": "value", "some_value": "cfg://v3.some_value" } } You can see that sys.stderr is the external object, and that there are various cyclic references. Now, using Python 2.7 (where dictConfig was introduced), you can do this:
import json with open("example.json") as f: ... d = json.load(f) ... d {u'v1': u'ext://sys.stderr', u'v2': u'cfg://v3.some_list[1]', u'v3': {u'some_lis t': [1, 2, 3], u'some_value': u'cfg://v4.key'}, u'v4': {u'some_value': u'cfg://v 3.some_value', u'key': u'value'}} import logging.config bc = logging.config.BaseConfigurator(d) bc.config['v1'] <open file '<stderr>', mode 'w' at 0x004750D0> bc.config['v2'] 2 bc.config['v3'] {u'some_list': [1, 2, 3], u'some_value': u'cfg://v4.key'} bc.config['v3']['some_value'] u'value' bc.config['v4']['some_value'] u'value'
The values are lazily evaluated (as you can see from the above example for bc.config['v3'], but that isn't necessarily a problem. Of course, one can build on this using keys to define classes, kwargs, etc., which is how logging's DictConfigurator (derived from BaseConfigurator) works. The handlers for ext://, cfg:// are customisable, and additional schemes can be added; it allows for fairly flexible configuration, but within bounds you can set. I probably should write a post about the BaseConfigurator, it is useful outside the context of logging and I have used it for other things, but it's hidden away under the covers a bit. Regards, Vinay Sajip
participants (15)
-
Andrew Barnert
-
Antoine Pitrou
-
Bruce Leban
-
Chris Rebert
-
Devin Jeanpierre
-
Dustin J. Mitchell
-
Eric V. Smith
-
Joao S. O. Bueno
-
Mark Hackett
-
Masklinn
-
Ned Batchelder
-
Stephen J. Turnbull
-
Steve Dower
-
Steven D'Aprano
-
Vinay Sajip