[Distutils] PEP 438, pip and --allow-external (was: "pip: cdecimal an externally hosted file and may be unreliable" from python-dev)

Nick Coghlan ncoghlan at gmail.com
Mon May 12 06:50:02 CEST 2014


On 12 May 2014 12:27, Donald Stufft <donald at stufft.io> wrote:
>
> On May 11, 2014, at 7:35 PM, Donald Stufft <donald at stufft.io> wrote:
>
> However before I go further on that I want to dig more into the impact of
> these
> things. It dawned on me earlier today that the way I was categorizing things
> in my earlier number crunching was making it unreasonably hard to actually
> divine any sort of meaning out of those numbers. I'm currently in the
> process
> of crawling all of PyPI again*, after I have those new numbers I'll have a
> better sense of things and I think a better forward plan can be made.
>
>
> I've completed the crawl. I've made the scripts and the data available at
> https://github.com/dstufft/pypi-external-stats.

Thanks for that.

> Here's the general statistics from that:
>
> Hosted on PyPI: 37779
> Hosted Externally (<50%): 18
> Hosted Externally (>50%): 47
> Hosted Externally: 65
> Hosted Unsafely (<50%): 725
> Hosted Unsafely (>50%): 2249
> Hosted Unsafely: 2974

>From counting the number of "external1" packages in the JSON data you
linked, I take it "external1" & "external2" correspond to < 50% and >
50% (and ditto for "unsafe1" and "unsafe2")?

"pyOpenSSL" is the main one that catches my eye in the externally
hosted category, but closer investigation shows that is being thrown
off by an older external link for 0.11. All other releases, including
the newer 0.12, 0.13 and 0.14 releases are PyPI hosted. (If it's
practical, a "latest" release vs "any" release split would be even
more useful than the current more or less than 50% split - if the
latest release is externally hosted, silently receiving an older
version can actually be more problematic than not receiving a version
at all, and cases like pyOpenSSL show that even this new
categorisation may be overstating the number of packages relying on
external hosting).

There are some more notable names in the "unsafe" lists, but a few
spot checks on projects like PyGObject, PyGTK, biopython, dbus-python,
django-piston, ipaddr, matplotlib, and mayavi showed that a number of
them *have* switched to PyPI hosting for recent releases, but have
left older releases as externally hosted. (A few notable names, like
wxPython and Spyder, *did* show up as genuinely externally hosted.
Something that would be nice to be able to do, but isn't really
practical without a server side dependency graph, is to be able to
figure out how many packages have an externally hosted dependency
*somewhere in their dependency chain*, and *how many* other projects
are depending on particular externally hosted projects transitively).

Regardless, even with those caveats, the numbers are already solid
enough to back up the notion that the only possible reasons to support
enabling verified external hosting support independently of unverified
external hosting are policy and relationship management ones.
Relationship management would just mean providing a deprecation period
before removing the capability, but I want to spend some time
exploring a possible concrete *policy* related rationale for keeping
it.

The main legitimate reason I am aware of for wanting to avoid PyPI
hosting is for non-US based individuals and organisations to avoid
having to sign up to the "Any uploads of packages must comply with
United States export controls under the Export Administration
Regulations." requirement that the PSF is obliged to place on uploads
to the PSF controlled US hosted PyPI servers. That rationale certainly
applies in MAL's case, since eGenix is a German company, and I believe
they mostly do business outside the US (for example, their case study
in the Python brochure is for a government project in Ghana).

In relation to that, I double checked the egenix-mx-base package, and
(as noted earlier in the thread) that is one that *could* be
transitively verified, since a hash is provided on PyPI for the linked
index pages, which could be used to ensure that the hashes of the
download links are correct. That transitive verification could either
be done by pip on the fly, or else implemented as a tool that scanned
the linked page for URLs once, checked the hash and then POSTed the
specific external URLs to PyPI - the latter approach would have the
advantage of also speeding up downloads of affected packages by
allowing the project to be set to the "pypi-explicit" hosting mode.

That means the long term fate of a global
"--allow-all-verifiable-external" flag really hinges on a policy
decision: do we want to ensure it remains possible for non-US software
distributors to avoid subjecting their software to US export law,
without opening up their users to MITM attacks on other downloads?

Note that the occasionally recommended alternative to external link
support, adding a new index URL client side, is in itself a greater
risk than allowing verifiable external downloads linked from PyPI,
since dependency resolution and package lookups in general aren't
scoped by index URL - you're trusting the provider of a custom index
to not publish a "new" version of other PyPI packages that overrides
the PyPI version (even Linux distros haven't systematically solved
that problem, although tools like the yum priorities plugin address
most of the issues).

After considering the policy implications, and the deficiencies of the
"just run your own index server" approach, I think it makes sense to
preserve the "--allow-all-verifiable-external" option indefinitely,
even if it's confusing: it means we're leaving the option open for
individual projects and organisations to decide to accept a slightly
degraded user experience in order to remain free of entanglement with
US export restrictions, as well as allowing end users the option to
globally enable packages that may not comply with US export
restrictions (because they may be hosted outside the US), without
opening themselves up to additional security vulnerabilities.

By contrast, dropping this feature entirely would mean saying to
non-US users "you must agree to US export restrictions in order to
participate in PyPI at all", and I don't think we want to go down that
path.

Under that approach, per-package "--allow-external" settings would
still become the recommended solution for installation issues (since
it always works, regardless of whether or not the project is set up to
do it safely), the "--allow-all-external" option would be deprecated
in 1.6 and removed in 1.7, and "--allow-all-verifiable-external" would
be added as a non-deprecated spelling for the
not-necessarily-subject-to-US-export-laws external hosting support.

At-least-we're-not-dealing-with-ITAR-ly yours,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Distutils-SIG mailing list