On Wed, Jul 15, 2020 at 11:00 AM Steven D'Aprano email@example.com wrote:
On Wed, Jul 15, 2020 at 09:55:03AM +1000, Chris Angelico wrote:
At that point, you are NOT running it with the "exact same access permissions", are you? :)
Indeed, and I did acknowledge that you were probably thinking about a different scenario. But I was challenging your assertion that anyone who can write a malicious pickle could just as easily inject malicious code into my source code. That's not always correct.
It's correct far more often than you might think. There's a LOT of code out there where the Python source code has the exact same external access permissions as its config files - often because there's no access to either.
But a large amount of code is indeed run with the same access permissions as its temporary files (which may be incredibly restrictive or incredibly generous, either way).
Again, this is true. But we don't counter risks by pointing at the times that it's not a risk:
"Seat belts in cars? Ludicrous, most of the time the car is sitting still, not even moving, with nobody inside it! Why does it need seat belts?"
And if it's not moving, you don't have to wear them. I see this as a perfect parallel. When you are in a risky situation, you take care. When you have other reasons for not worrying about the risk (a seatbelt won't save you from meteor strike), you don't need to.
But if I'm distributing my code to others, the responsible thing to do is to think of the potential security risks about using pickle in my app, or library. What if they use it in ways that I didn't foresee, ways which *ought to be* safe except for my choice to use pickle?
I'm not demanding that developers be omniscient, but I do think that they should not willfully ignore known security risks.
"All care, no responsibility" is only meaningful if we do actually take care.
So if you're distributing your code, then maybe you don't use pickle.
And that's why we have JSON and various others,
How do I use JSON to serialise an arbitrary instance of some class?
Instances are just data. (Well, usually.) I should be able to serialise instances (well, most of them) and safely read them back again. Of course the gap between *should* and *can* is quite large, and Python really doesn't make it easy. I'm not saying this is an easy problem to solve.
One very VERY good option is to keep your code and data separate. In a lot of my projects, I do this very consciously and deliberately, ensuring that all my persistent data is JSON-safe. It makes things a lot easier to reason about when you don't have to concern yourself with refactoring breaking your saved data, which can certainly happen with pickle.
which are not pickle and are not vulnerable the way that pickle is. I don't think we need a "safe pickle".
So they're vulverable in other ways? :-)
Well, sure. Go ahead and point out JSON's vulnerabilities. :)
What we need is to not use pickle when it's not the right tool.
How do I know when it's not the right tool?
How do I know which other serialisation format is right?
What about those -- and they are a significant minority -- who are restricted to only what's in the stdlib?
Very good questions, and those are part of why we have multiple options.
If you're restricted to the stdlib and distributing your code, I would generally recommend defaulting to JSON, because it's a well-known format that anyone can parse. If you need more functionality than JSON offers, but you're still restricted to the stdlib, you'll probably end up having to roll your own JSONEncoder subclass that handles what you need, or doing what I say above and keeping data separate from code. Neither is terribly difficult, but either way, you have to think slightly differently about what gets persisted. That's not a 100% ideal situation by any means, but it isn't as terrible as you might think.
I'm highly sympathetic to the requests for "JSON but able to encode more types", but not so sympathetic to "pickle but magically able to be safe".
Okay, let's say that somebody else did the work. Some awfully clever chappy found a way to add a magical "pickle.safeload()" function that did everything needed, safely. Would you oppose it?
(The old unsafe one would presumably have to remain for backwards compatibility, or for the cases which are inherently unsafe.)
I would ask them which laws of physics they violated, since pickle inherently has to be able to execute arbitrary code in order to be able to do everything it needs to.
If someone claims they've created a way to allow untrusted users to insert code into your Python programs and have it execute, but they've made it safe, would you oppose its inclusion in the stdlib? How much security hardening would it take before you can confidently say that it really is safe?
If not, then it seems to me you don't really care about this issue and could sit out of it :-)
If you do *actively oppose* adding a safe version of pickle, perhaps you should explain why.
I actively oppose it because it isn't possible. Anything that is safe will not have all of pickle's functionality. A nerfed version of pickle that can only unpickle a tiny handful of core data types is no better than other options that already exist. The entire point of pickling arbitrary objects is that you can unpickle arbitrary objects. That's inherently unsafe if there is any possibility that the pickle file came from an untrusted user, and I do indeed oppose plans to try to make pickle what it isn't.
You want "JSON but with a tagging system so it can unpickle dates and times"? No problem. You want "an encoder that can save int/str/list/dict and dataclasses"? There'd be no end of bikeshedding on how it handles mismatched classes, but that seems pretty doable. You want "pickle but magically able to know what's safe and what's not"? No.