[Distutils] Surviving a Compromise of PyPI - PEP 458 and 480

Nick Coghlan ncoghlan at gmail.com
Fri Jan 2 06:57:22 CET 2015

On 1 January 2015 at 05:51, Donald Stufft <donald at stufft.io> wrote:

> So here is my problem. I’m completely on board with the developer signing
> for the distribution files. I think that makes total sense. However I worry
> that requiring the developer to sign for what is essentially the
> “installer” API (aka how pip discovers things to install) is going to put
> us in a situation where we cannot evolve the API easily. If we modified
> this PEP so that an online key signed for /simple/ what security properties
> would we lose?
> It *appears* to me that the problem then would be that a compromise of
> PyPI can present whatever information they want to pip as to what is
> available for pip to download and install. This would mean freeze attacks,
> mix and match attacks. It would also mean that they could, in a future
> world where pip can use metadata on PyPI to do dependency resolution, tell
> pip that it needs to download a valid but malicious project as a dependency
> of a popular project like virtualenv.
> However I don’t think they’d be able to actually cause pip to install a
> malicious copy of a good project and I believe that we can protect against
> an attacker who poses that key from tricking pip into installing a
> malicious but valid project as a fake dependency by having pip only use the
> theoretical future PyPI metadata that lists dependencies as an optimization
> hint for what it should download and then once it’s actually downloaded a
> project like virtualenv (which has been validated to be from the real
> author) peek inside that file and ensure that the metadata inside that
> matches what PyPI told pip.
> Is my assessment correct? Is keeping the “API” under control of PyPI a
> reasonable thing to do while keeping the actual distribution files
> themselves under control of the distribution authors? The reason this
> worries me is that unlikely a Linux distribution or an application like
> Firefox or so we don’t have much of a relationship with the people who are
> uploading things to PyPI. So if we need to evolve the API we are not going
> to be able to compel our authors to go back and re-generate new signed
> metadata.

I think this is a good entry point for an idea I've had kicking around in
my brain for the past couple of days: what if we change the end goal of PEP
480 slightly, from "prevent attackers from compromising published PyPI
metadata" to "allow developers & administrators to rapidly detect and
recover from compromised PyPI metadata"?

My reasoning is that when it comes to PyPI security, there are actually two
major dials we can twiddle:

* raising the cost of an attack (e.g. making compromise harder by
distributing signing authority to developers)
* reducing the benefit of an attack (e.g. making the expected duration, and
hence reach, of a compromise lower, or downgrading an artifact substitution
attack to a denial of service attack)

To raise the cost of a compromise through distributed signing authority, we
have to solve the trust management problem - getting developer keys out to
end users in a way that doesn't involve trusting the central PyPI service.
That's actually a really difficult problem to solve, which is why we have
situations like TLS still relying on the CA system, despite the known
problems with the latter.

However, the latter objective is potentially more tractable: we wouldn't
need to distribute trust management out to arbitrary end users, we'd "just"
need a federated group of entities that are in a position to detect that
PyPI has potentially been compromised, and request a service shutdown until
such time as the compromise has been investigated and resolved.

This notion isn't fully evolved yet (that's why this email is so long), but
it feels like a far more viable direction to me than the idea of pushing
the enhanced security management problem back on to end users.

Suppose, for example, there were additional independently managed
validation services hosting TUF metadata for various subsets of PyPI. The
enhanced security model would then involve developers opting in to
uploading their package metadata to one or more of the validation servers,
rather than just to the main PyPI server. pip itself wouldn't worry about
checking the validation services - it would just check against the main
server as it does today, so we wouldn't need to worry about how we get the
root keys for the validation servers out to arbitrary client end points.

That is, rather than "sign your own packages", the enhanced security model
becomes "get multiple entities to sign your packages, so compromise of any
one entity (including PyPI itself) can be detected and investigated

The *validation* services would then be responsible for checking that their
own registered metadata matched the metadata being published on PyPI. If
they detect a discrepancy between their own metadata and PyPI's, then we'd
have a human-in-the-loop process for reporting the problem, and the most
likely response would be to disable PyPI downloads while the situation was

I believe something like that would change the threat landscape in a
positive way, and has three very attractive features over distributed
signing authority:

* It's completely transparent at the point of installation - it transforms
PEP 480 into a back end data integrity validation project, rather than
something that affects the end user experience of the PyPI ecosystem. The
changes to the installation experience would be completely covered by PEP
* Uploading metadata to additional servers for signing is relatively low
impact on developers (if they have an automated release process, it's
likely just another line in a script somewhere), significantly lowering
barriers to adoption relative to asking developers to sign their own
* Folks that decide to run or use a validation server are likely going to
be more closely engaged with the PyPI community, and hence easier to reach
as the metadata requirements evolve

In terms of how I believe such a change would mitigate the threat of a PyPI

* it provides a cryptographically validated way to detect a compromise of
any packages registered with one or more validation services, significantly
reducing the likelihood of a meaningful PyPI compromise going undetected
* in any subsequent investigation, we'd have multiple sets of
cryptographically validated metadata to compare to identify exactly what
was compromised, and how it was compromised
* the new attack vectors introduced (by compromising the validation
services rather than PyPI itself) are *denial of service* attacks (due to
PyPI downloads being disabled while the discrepancy is investigated),
rather than the artifact substitution that is possible by attacking PyPI

That means we would move from the status quo, where a full PyPI compromise
may permit silent substitution of artifacts to one where an illicit online
package substitution would likely be detected in minutes or hours for high
profile projects, so the likely pay-off for an attack on the central
infrastructure is a denial of service against organisations not using their
own local PyPI mirrors, rather than arbitrary software installation on a
wide range of systems.

Another nice benefit of this approach is that it also protects against
attacks on developer PyPI *accounts*, so long as they use different
authentication mechanisms on the validation server over the main PyPI
server. For example, larger organisations could run their *own* validation
server for the packages they publish, and manage it using offline keys as
recommended by TUF - that's a lot easier to do when you don't need to allow
arbitrary uploads.

Specific *projects* could still be attacked (by compromising developer
systems), but that's not a new threat, and outside the scope of PEP 458/480
- we're aiming to mitigate the threat of *systemic* compromise that
currently makes PyPI a relatively attractive target.

As far as the pragmatic aspects go, we could either go with a model where
projects are encouraged to run their *own* validation services on something
like OpenShift (or even a static hosting site if they generate their
validation metadata locally), or else we could look for willing partners to
host public PyPI metadata validation servers (e.g. the OpenStack
Foundation, Fedora/Red Hat, perhaps someone from the
Debian/Ubuntu/Canonical ecosystem, perhaps some of the other commercial
Python redistributors)


[1] Via Leigh Alexander, I was recently introduced to this excellent paper
on understanding and working with the mental threat models that users
actually have, rather than attempting to educate the users:
https://cups.cs.cmu.edu/soups/2010/proceedings/a11_Walsh.pdf. While the
paper is specifically written in the context of home PC security, I think
that's good advice in general: adjusting software systems to accommodate
the reality of human behaviour is usually going to be far more effective
than attempting to teach humans to conform to the current needs of the

Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20150102/91fd15c9/attachment-0001.html>

More information about the Distutils-SIG mailing list