On 1 January 2015 at 05:51, Donald Stufft <donald@stufft.io> wrote:

So here is my problem. I’m completely on board with the developer signing for the distribution files. I think that makes total sense. However I worry that requiring the developer to sign for what is essentially the “installer” API (aka how pip discovers things to install) is going to put us in a situation where we cannot evolve the API easily. If we modified this PEP so that an online key signed for /simple/ what security properties would we lose?

It *appears* to me that the problem then would be that a compromise of PyPI can present whatever information they want to pip as to what is available for pip to download and install. This would mean freeze attacks, mix and match attacks. It would also mean that they could, in a future world where pip can use metadata on PyPI to do dependency resolution, tell pip that it needs to download a valid but malicious project as a dependency of a popular project like virtualenv.

However I don’t think they’d be able to actually cause pip to install a malicious copy of a good project and I believe that we can protect against an attacker who poses that key from tricking pip into installing a malicious but valid project as a fake dependency by having pip only use the theoretical future PyPI metadata that lists dependencies as an optimization hint for what it should download and then once it’s actually downloaded a project like virtualenv (which has been validated to be from the real author) peek inside that file and ensure that the metadata inside that matches what PyPI told pip.

Is my assessment correct? Is keeping the “API” under control of PyPI a reasonable thing to do while keeping the actual distribution files themselves under control of the distribution authors? The reason this worries me is that unlikely a Linux distribution or an application like Firefox or so we don’t have much of a relationship with the people who are uploading things to PyPI. So if we need to evolve the API we are not going to be able to compel our authors to go back and re-generate new signed metadata.

I think this is a good entry point for an idea I've had kicking around in my brain for the past couple of days: what if we change the end goal of PEP 480 slightly, from "prevent attackers from compromising published PyPI metadata" to "allow developers & administrators to rapidly detect and recover from compromised PyPI metadata"?

My reasoning is that when it comes to PyPI security, there are actually two major dials we can twiddle:

* raising the cost of an attack (e.g. making compromise harder by distributing signing authority to developers)
* reducing the benefit of an attack (e.g. making the expected duration, and hence reach, of a compromise lower, or downgrading an artifact substitution attack to a denial of service attack)

To raise the cost of a compromise through distributed signing authority, we have to solve the trust management problem - getting developer keys out to end users in a way that doesn't involve trusting the central PyPI service. That's actually a really difficult problem to solve, which is why we have situations like TLS still relying on the CA system, despite the known problems with the latter.

However, the latter objective is potentially more tractable: we wouldn't need to distribute trust management out to arbitrary end users, we'd "just" need a federated group of entities that are in a position to detect that PyPI has potentially been compromised, and request a service shutdown until such time as the compromise has been investigated and resolved.

This notion isn't fully evolved yet (that's why this email is so long), but it feels like a far more viable direction to me than the idea of pushing the enhanced security management problem back on to end users.

Suppose, for example, there were additional independently managed validation services hosting TUF metadata for various subsets of PyPI. The enhanced security model would then involve developers opting in to uploading their package metadata to one or more of the validation servers, rather than just to the main PyPI server. pip itself wouldn't worry about checking the validation services - it would just check against the main server as it does today, so we wouldn't need to worry about how we get the root keys for the validation servers out to arbitrary client end points.

That is, rather than "sign your own packages", the enhanced security model becomes "get multiple entities to sign your packages, so compromise of any one entity (including PyPI itself) can be detected and investigated appropriately".

The *validation* services would then be responsible for checking that their own registered metadata matched the metadata being published on PyPI. If they detect a discrepancy between their own metadata and PyPI's, then we'd have a human-in-the-loop process for reporting the problem, and the most likely response would be to disable PyPI downloads while the situation was resolved.

I believe something like that would change the threat landscape in a positive way, and has three very attractive features over distributed signing authority:

* It's completely transparent at the point of installation - it transforms PEP 480 into a back end data integrity validation project, rather than something that affects the end user experience of the PyPI ecosystem. The changes to the installation experience would be completely covered by PEP 458.
* Uploading metadata to additional servers for signing is relatively low impact on developers (if they have an automated release process, it's likely just another line in a script somewhere), significantly lowering barriers to adoption relative to asking developers to sign their own packages.
* Folks that decide to run or use a validation server are likely going to be more closely engaged with the PyPI community, and hence easier to reach as the metadata requirements evolve

In terms of how I believe such a change would mitigate the threat of a PyPI compromise:

* it provides a cryptographically validated way to detect a compromise of any packages registered with one or more validation services, significantly reducing the likelihood of a meaningful PyPI compromise going undetected
* in any subsequent investigation, we'd have multiple sets of cryptographically validated metadata to compare to identify exactly what was compromised, and how it was compromised
* the new attack vectors introduced (by compromising the validation services rather than PyPI itself) are *denial of service* attacks (due to PyPI downloads being disabled while the discrepancy is investigated), rather than the artifact substitution that is possible by attacking PyPI directly

That means we would move from the status quo, where a full PyPI compromise may permit silent substitution of artifacts to one where an illicit online package substitution would likely be detected in minutes or hours for high profile projects, so the likely pay-off for an attack on the central infrastructure is a denial of service against organisations not using their own local PyPI mirrors, rather than arbitrary software installation on a wide range of systems.

Another nice benefit of this approach is that it also protects against attacks on developer PyPI *accounts*, so long as they use different authentication mechanisms on the validation server over the main PyPI server. For example, larger organisations could run their *own* validation server for the packages they publish, and manage it using offline keys as recommended by TUF - that's a lot easier to do when you don't need to allow arbitrary uploads.

Specific *projects* could still be attacked (by compromising developer systems), but that's not a new threat, and outside the scope of PEP 458/480 - we're aiming to mitigate the threat of *systemic* compromise that currently makes PyPI a relatively attractive target.

As far as the pragmatic aspects go, we could either go with a model where projects are encouraged to run their *own* validation services on something like OpenShift (or even a static hosting site if they generate their validation metadata locally), or else we could look for willing partners to host public PyPI metadata validation servers (e.g. the OpenStack Foundation, Fedora/Red Hat, perhaps someone from the Debian/Ubuntu/Canonical ecosystem, perhaps some of the other commercial Python redistributors)


[1] Via Leigh Alexander, I was recently introduced to this excellent paper on understanding and working with the mental threat models that users actually have, rather than attempting to educate the users: https://cups.cs.cmu.edu/soups/2010/proceedings/a11_Walsh.pdf. While the paper is specifically written in the context of home PC security, I think that's good advice in general: adjusting software systems to accommodate the reality of human behaviour is usually going to be far more effective than attempting to teach humans to conform to the current needs of the software.

Nick Coghlan   |   ncoghlan@gmail.com   |   Brisbane, Australia