[Distutils] PEP 438, pip and --allow-external (was: "pip: cdecimal an externally hosted file and may be unreliable" from python-dev)

Donald Stufft donald at stufft.io
Mon May 12 07:39:32 CEST 2014


On May 12, 2014, at 12:50 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> On 12 May 2014 12:27, Donald Stufft <donald at stufft.io> wrote:
>> 
>> On May 11, 2014, at 7:35 PM, Donald Stufft <donald at stufft.io> wrote:
>> 
>> However before I go further on that I want to dig more into the impact of
>> these
>> things. It dawned on me earlier today that the way I was categorizing things
>> in my earlier number crunching was making it unreasonably hard to actually
>> divine any sort of meaning out of those numbers. I'm currently in the
>> process
>> of crawling all of PyPI again*, after I have those new numbers I'll have a
>> better sense of things and I think a better forward plan can be made.
>> 
>> 
>> I've completed the crawl. I've made the scripts and the data available at
>> https://github.com/dstufft/pypi-external-stats.
> 
> Thanks for that.
> 
>> Here's the general statistics from that:
>> 
>> Hosted on PyPI: 37779
>> Hosted Externally (<50%): 18
>> Hosted Externally (>50%): 47
>> Hosted Externally: 65
>> Hosted Unsafely (<50%): 725
>> Hosted Unsafely (>50%): 2249
>> Hosted Unsafely: 2974
> 
> From counting the number of "external1" packages in the JSON data you
> linked, I take it "external1" & "external2" correspond to < 50% and >
> 50% (and ditto for "unsafe1" and "unsafe2”)?

That’s correct.

> 
> "pyOpenSSL" is the main one that catches my eye in the externally
> hosted category, but closer investigation shows that is being thrown
> off by an older external link for 0.11. All other releases, including
> the newer 0.12, 0.13 and 0.14 releases are PyPI hosted. (If it's
> practical, a "latest" release vs "any" release split would be even
> more useful than the current more or less than 50% split - if the
> latest release is externally hosted, silently receiving an older
> version can actually be more problematic than not receiving a version
> at all, and cases like pyOpenSSL show that even this new
> categorisation may be overstating the number of packages relying on
> external hosting).

That’s not a bad idea, it’ll require a little more logic since I’ll have to
parse the versions out of the filenames, but it shouldn’t be terrible to do and
I can do it with the existing data.json instead of needing to recrawl.

The 50% thing I just kinda tossed in at the last minute. I had tried to not to
put my own spin on the numbers as much as I could since I think by now
it’s quite obvious what I think should happen and I think the numbers support
that even without my spin.

> 
> There are some more notable names in the "unsafe" lists, but a few
> spot checks on projects like PyGObject, PyGTK, biopython, dbus-python,
> django-piston, ipaddr, matplotlib, and mayavi showed that a number of
> them *have* switched to PyPI hosting for recent releases, but have
> left older releases as externally hosted. (A few notable names, like
> wxPython and Spyder, *did* show up as genuinely externally hosted.
> Something that would be nice to be able to do, but isn't really
> practical without a server side dependency graph, is to be able to
> figure out how many packages have an externally hosted dependency
> *somewhere in their dependency chain*, and *how many* other projects
> are depending on particular externally hosted projects transitively).

I could maybe do it with a mirror and a throw away VM but I think it’d
be a decent chunk of effort.

> 
> Regardless, even with those caveats, the numbers are already solid
> enough to back up the notion that the only possible reasons to support
> enabling verified external hosting support independently of unverified
> external hosting are policy and relationship management ones.
> Relationship management would just mean providing a deprecation period
> before removing the capability, but I want to spend some time
> exploring a possible concrete *policy* related rationale for keeping
> it.
> 
> The main legitimate reason I am aware of for wanting to avoid PyPI
> hosting is for non-US based individuals and organisations to avoid
> having to sign up to the "Any uploads of packages must comply with
> United States export controls under the Export Administration
> Regulations." requirement that the PSF is obliged to place on uploads
> to the PSF controlled US hosted PyPI servers. That rationale certainly
> applies in MAL's case, since eGenix is a German company, and I believe
> they mostly do business outside the US (for example, their case study
> in the Python brochure is for a government project in Ghana).

Yes that is the main reason I can distill from the various threads that
have occurred over time.

> 
> In relation to that, I double checked the egenix-mx-base package, and
> (as noted earlier in the thread) that is one that *could* be
> transitively verified, since a hash is provided on PyPI for the linked
> index pages, which could be used to ensure that the hashes of the
> download links are correct. That transitive verification could either
> be done by pip on the fly, or else implemented as a tool that scanned
> the linked page for URLs once, checked the hash and then POSTed the
> specific external URLs to PyPI - the latter approach would have the
> advantage of also speeding up downloads of affected packages by
> allowing the project to be set to the "pypi-explicit" hosting mode.

So it can kind of be verified. It'll work most of the time but corporate
proxies and the like can break that pretty easily since one of the things that
some of them will do is rewrite HTML in responses. There are some headers you
can add to tell them not to do that but there are ones that are not compliant
that will do it anyways.

This sort of thing has been a headache for pip lately because of the .tar.gz
extension and servers/proxies trying to be smart about the headers.

> 
> That means the long term fate of a global
> "--allow-all-verifiable-external" flag really hinges on a policy
> decision: do we want to ensure it remains possible for non-US software
> distributors to avoid subjecting their software to US export law,
> without opening up their users to MITM attacks on other downloads?
> 
> Note that the occasionally recommended alternative to external link
> support, adding a new index URL client side, is in itself a greater
> risk than allowing verifiable external downloads linked from PyPI,
> since dependency resolution and package lookups in general aren't
> scoped by index URL - you're trusting the provider of a custom index
> to not publish a "new" version of other PyPI packages that overrides
> the PyPI version (even Linux distros haven't systematically solved
> that problem, although tools like the yum priorities plugin address
> most of the issues).

I’m not sure the distinction makes much sense for PyPI/pip. You basically
have to trust the authors of the packages you’re installing. If a package
author is willing to hijack another package with a custom index they could
just as easily do something malicious in a setup.py. Even if we get rid of
the setup.py there are still endless ways of attacking someone who is
installing your package and they are basically impossible to prevent and
are just as bad or worse than that.

Ultimately I think that providing a custom index for your packages that
people pass to the CLI, put in their settings file, or their requirements.txt
is the correct solution for that case.

> 
> After considering the policy implications, and the deficiencies of the
> "just run your own index server" approach, I think it makes sense to
> preserve the "--allow-all-verifiable-external" option indefinitely,
> even if it's confusing: it means we're leaving the option open for
> individual projects and organisations to decide to accept a slightly
> degraded user experience in order to remain free of entanglement with
> US export restrictions, as well as allowing end users the option to
> globally enable packages that may not comply with US export
> restrictions (because they may be hosted outside the US), without
> opening themselves up to additional security vulnerabilities.
> 
> By contrast, dropping this feature entirely would mean saying to
> non-US users "you must agree to US export restrictions in order to
> participate in PyPI at all", and I don't think we want to go down that
> path.
> 
> Under that approach, per-package "--allow-external" settings would
> still become the recommended solution for installation issues (since
> it always works, regardless of whether or not the project is set up to
> do it safely), the "--allow-all-external" option would be deprecated
> in 1.6 and removed in 1.7, and "--allow-all-verifiable-external" would
> be added as a non-deprecated spelling for the
> not-necessarily-subject-to-US-export-laws external hosting support.

Like I said above, I think this is ultimately the wrong long term solution. I
personally feel that the saner long term thing to do is to drop the notion of
externally hosted packages all together and use the multiple index support
instead.

My reasons are:

* It's only somewhat nicer up front than providing a custom index however it
  represents an additional command line flag that users have to learn.


* It's not particularly any safer than providing a custom index except in a way
  that doesn't really matter.


* The existence of external fetching in pip complicates the code base and makes
  it hard to provide guidance to users.
  
  We essentially have to assume that an URL won't work, so instead of providing
  clear error messages we just ignore a failing URL. If that URL is temporarily
  down instead of a clear and obvious "this URL is failing" the real error is
  silent and they'll either get a lower version (if they're lucky) or they'll 
  get an error saying that no versions could be found for foo.

  If instead we only supported the indexes/find-links we've been given then we
  change our assumption that those URLs will exist and work and we can provide
  clear up front guidance at the time of failure [1].


* I hate the idea of a long term --allow-all-verified-external (or any variant
  of it). It feels way too much to me like a "unbreak my pip please" flag and
  I think that it is how users who need to use it will perceive it. This
  will create more animosity and hostility towards the packaging toolchain.

  I went into this on the pip PR, but essentially I see this becoming a turd
  that people chuck into their ~/.pip/pip.conf, requirements.txt, environment,
  or build scripts. They'll run into a problem where they need it, shove it
  into their config and then forget about it until they try to deploy to a
  new machine, or service, or whatever and run into that problem again.


* I don't agree it says to non-US users that they must agree to the US export
  rules in order to participate in PyPI at all. They'll still be able to
  register their projects with PyPI, provide docs there. They just won't get
  as streamlined install experience. They'll have to provide some installation
  instructions.

  There is possibly even something we can do to make this more streamlined.
  Like perhaps they can register their custom index with PyPI and PyPI can
  advise pip of it and if pip finds that advisory pip can report it to the user
  and say "foo bar is hosted on a separate repository and in order to install
  it you'll need to add "https://example.com/my-cool-packages/" to your index
  URLs.


* We constantly tell people that if you depend on PyPI you need to run a
  mirror, however if a file isn't uploaded to PyPI then the user can't rely on
  the fact that the file existing on PyPI means they have the right to mirror
  and distribute it. This means that we force people who want to isolate
  themselves from external dependencies to manually resolve any externally
  hosted dependency. Most of them are not lawyers and may or may not have any
  idea what all that means or have a good sense if they can do that or not.

  It's true that this problem still exists with an external index, however by
  moving to a "stand up your own index" solution it becomes easier for people
  to reason about which dependencies they need to figure it out for since there
  will be a clear separation of things that came from PyPI vs things that came
  from another index.


* Long term I think that both PyPI and pip should disallow external hosting and
  require the use of an additional index. However that will require a new PEP
  to discuss that. I'm still thinking that through but the more I think about
  it, dig into pip's code base, and talk to people, the more convinced I become
  that it is the right long term decision.

  That does not mean people will need to upload to PyPI to participate on PyPI
  since a large part of what PyPI provides is discover-ability and a central
  naming authority.


[1] This has been a reoccurring problem with people with old OpenSSL installs
    where they'll be unable to access PyPI at all however we silently ignore
    the failure and (log it with DEBUG level actually) because the assumption
    in pip is that we should keep trucking because we don't know if an URL
    should be installed or not. It's been a constant source of confusion. If I
    had to guess I'd say there's at least one or two people a month who come
    into our channels or talk to me personally where there underlying confusion
    stemmed from that.



-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20140512/27e99e37/attachment.sig>


More information about the Distutils-SIG mailing list