[Python-ideas] Re: Pickle security improvements

14 Jul 2020

      On Wed, Jul 15, 2020 at 11:00 AM Steven D'Aprano <steve@pearwood.info> wrote:
...
On Wed, Jul 15, 2020 at 09:55:03AM +1000, Chris Angelico wrote:
...
At that point, you are NOT running it with the "exact same access
permissions", are you? :)
Indeed, and I did acknowledge that you were probably thinking about a
different scenario. But I was challenging your assertion that anyone who
can write a malicious pickle could just as easily inject malicious code
into my source code. That's not always correct.
It's correct far more often than you might think. There's a LOT of
code out there where the Python source code has the exact same
external access permissions as its config files - often because
there's no access to either.
...
...
But a large amount of code is indeed run
with the same access permissions as its temporary files (which may be
incredibly restrictive or incredibly generous, either way).
Again, this is true. But we don't counter risks by pointing at the times
that it's not a risk:
"Seat belts in cars? Ludicrous, most of the time the car is sitting
still, not even moving, with nobody inside it! Why does it need seat
belts?"
And if it's not moving, you don't have to wear them. I see this as a
perfect parallel. When you are in a risky situation, you take care.
When you have other reasons for not worrying about the risk (a
seatbelt won't save you from meteor strike), you don't need to.
...
But if I'm distributing my code to others, the responsible thing to do
is to think of the potential security risks about using pickle in my
app, or library. What if they use it in ways that I didn't foresee, ways
which *ought to be* safe except for my choice to use pickle?
I'm not demanding that developers be omniscient, but I do think that
they should not willfully ignore known security risks.
"All care, no responsibility" is only meaningful if we do actually take
care.
So if you're distributing your code, then maybe you don't use pickle.
...
...
And that's why we have JSON and various others,
How do I use JSON to serialise an arbitrary instance of some class?
Instances are just data. (Well, usually.) I should be able to serialise
instances (well, most of them) and safely read them back again. Of
course the gap between *should* and *can* is quite large, and Python
really doesn't make it easy. I'm not saying this is an easy problem to
solve.
One very VERY good option is to keep your code and data separate. In a
lot of my projects, I do this very consciously and deliberately,
ensuring that all my persistent data is JSON-safe. It makes things a
lot easier to reason about when you don't have to concern yourself
with refactoring breaking your saved data, which can certainly happen
with pickle.
...
...
which are not pickle
and are not vulnerable the way that pickle is. I don't think we need a
"safe pickle".
So they're vulverable in other ways? :-)
Well, sure. Go ahead and point out JSON's vulnerabilities. :)
...
...
What we need is to not use pickle when it's not the
right tool.
How do I know when it's not the right tool?
How do I know which other serialisation format is right?
What about those -- and they are a significant minority -- who are
restricted to only what's in the stdlib?
Very good questions, and those are part of why we have multiple options.

If you're restricted to the stdlib and distributing your code, I would
generally recommend defaulting to JSON, because it's a well-known
format that anyone can parse. If you need more functionality than JSON
offers, but you're still restricted to the stdlib, you'll probably end
up having to roll your own JSONEncoder subclass that handles what you
need, or doing what I say above and keeping data separate from code.
Neither is terribly difficult, but either way, you have to think
slightly differently about what gets persisted. That's not a 100%
ideal situation by any means, but it isn't as terrible as you might
think.
...
...
I'm highly sympathetic to the requests for "JSON but able to encode
more types", but not so sympathetic to "pickle but magically able to
be safe".
Okay, let's say that somebody else did the work. Some awfully clever
chappy found a way to add a magical "pickle.safeload()" function that
did everything needed, safely. Would you oppose it?
(The old unsafe one would presumably have to remain for backwards
compatibility, or for the cases which are inherently unsafe.)
I would ask them which laws of physics they violated, since pickle
inherently has to be able to execute arbitrary code in order to be
able to do everything it needs to.

If someone claims they've created a way to allow untrusted users to
insert code into your Python programs and have it execute, but they've
made it safe, would you oppose its inclusion in the stdlib? How much
security hardening would it take before you can confidently say that
it really is safe?
...
If not, then it seems to me you don't really care about this issue and
could sit out of it :-)
If you do *actively oppose* adding a safe version of pickle, perhaps you
should explain why.
I actively oppose it because it isn't possible. Anything that is safe
will not have all of pickle's functionality. A nerfed version of
pickle that can only unpickle a tiny handful of core data types is no
better than other options that already exist. The entire point of
pickling arbitrary objects is that you can unpickle arbitrary objects.
That's inherently unsafe if there is any possibility that the pickle
file came from an untrusted user, and I do indeed oppose plans to try
to make pickle what it isn't.

You want "JSON but with a tagging system so it can unpickle dates and
times"? No problem. You want "an encoder that can save
int/str/list/dict and dataclasses"? There'd be no end of bikeshedding
on how it handles mismatched classes, but that seems pretty doable.
You want "pickle but magically able to know what's safe and what's
not"? No.

ChrisA