[Distutils] PEP 470 discussion, part 3

Donald Stufft donald at stufft.io
Wed Jul 23 20:18:58 CEST 2014

On July 23, 2014 at 1:09:00 PM, Richard Jones (r1chardj0n3s at gmail.com) wrote:
I have been mulling over PEP 470 for some time without having the time to truly dedicate to addressing it. I believe I'm up to date with its contents and the (quite significant, and detailed) discussion around it.

To summarise my understanding, PEP 470 proposes to remove the current link spidering (pypi-scrape, pypi-scrape-crawl) while retaining explicit hosting (pypi-explicit). I believe it retains the explicit links to external hosting provided by pypi-explicit.
No, it removes pypi-explicit as well, leaving only files hosted on PyPI. On top of that it adds a new functionality where project authors can indicate that their files are hosted on a non PyPI index. This allows tooling to indicate to users that they need to add additional indexes to their install commands in order to install something, as well as allowing PyPI to still act as a central authority for naming without forcing people to upload to PyPI.

The reason given for this change is the current bad user experience around the --allow-external and --allow-unverified options to pip install. That is, that users currently attempt to install a non-pypi-explicit package and the result is an obscure error message.
That’s part of the bad UX, the other part is that users are not particularly aware of the difference between an external vs an unverified link (in fact many people involved in packaging were not aware until it was explained to them by me, the difference is subtle). Part of the problem is while it’s easy for *tooling* to determine the difference between external and unverified, for a human it requires inspecting the actual HTML of the /simple/ page.

I believe the current PEP addresses the significant usability issues around this by swapping them for other usability issues. In fact, I believe it will make matters worse with potential confusion about which index hosts what, potential masking of release files or even, in the worst scenario, potential spoofing of release files by indexes out of the control of project owners.
So that’s a potential problem with any multi index thing yes. However I do not believe they are serious problems. It is a model that is in use by every linux vendor ever and anyone who has ever used a Linux (or most of the various BSDs) are already familiar with it. On top of that it is something that end users would need to be aware of if they want to use a private index, or they want to install commercial software that has a restricted index, or any other number of situations. In other words multiple indexes don’t go away, they will always be there. The effect of PEP 438 is that users need to be aware of *two* different ways of installing things not hosted on PyPI instead of just one. 

This two concepts instead of one is another part of the bad UX inflicted by PEP 438. The zen states that there should be one way to do something, and I think that is a good thing to strive for. 

I would like us to consider instead working on the usability of the existing workflow, by rather than throwing an error, we start a dialog with the user:

$ pip install PIL
Downloading/unpacking PIL
  PIL is hosted externally to PyPI. Do you still wish to download it? [Y/n] y
  PIL has no checksum. Are you sure you wish to download it? [Y/n] y
Downloading/unpacking PIL
  Downloading PIL-1.1.7.tar.gz (506kB): 506kB downloaded

Obviously this would require scraping the site, but given that this interaction would only happen for a very small fraction of projects (those for which no download is located), the overall performance hit is negligible. The PEP currently states that this would be a "massive performance hit" for reasons I don't understand.
It’s a big performance hit because we can’t just assume that if there is a download located on PyPI that there is not a better download hosted externally. So in order to actually do this accurately then we must scan any URL we locate in order to build up an entire list of all the potential files, and then ask if the person wants to download it.

For a sort of indication of the difference, I can scan all of PyPI looking for potential release files in about 20 minutes if I restrict myself to only things hosted directly on PyPI. If I include the additional scanning then that time jumps up to 3-4 hours. That’s what, 13x slower? And that’s with an incredibly aggressive timeout and a blacklist to only try bad hosts once.

The two prompts could be made automatic "y" responses for tools using the existing --allow-external and --allow-unverified flags.

I also note that PEP 470 says "PEP 438 proposed a system of classifying file links as either internal, external, or unsafe", whereas PEP 438 has no mention of "unsafe". This leads "unsafe" to never actually be defined anywhere that I can see.
I can define them in the PEP, but basically:

* internal - Things hosted by PyPI itself.

* external - Things hosted off of PyPI, but linked directly from the /simple/ page with an acceptable hash

* unsafe - Things hosted off of PyPI, either linked directly from the /simple/ page *without* an acceptable hash, or things hosted on a page which is linked from a rel=“homepage” or rel=“download” link.

Finally, the Rejected Proposals section of the PEP appears to have a couple of justifications for rejection which have nothing whatsoever to do with the Rationale ("PyPI is fronted by a globally distributed CDN...", "PyPI supports mirroring...") As Holger has already indicated, that second one is going to have a heck of a time dealing with PEP 470 changes at least in the devpi case.
PEP 381 mirroring will require zero changes to deal with the proposed change since it explicitly requires that the mirror client download the HTML of the /simple/ page and serve it unmodified. If devpi requires changes that is because it does not follow the documented protocol.

Those additional justifications are why we need a much clearer line between what is available on the PyPI repository, and what is available elsewhere. They are why we can’t just eliminate the ``—allow-external`` case (which is safe, but has availability and speed concerns).

 "PyPI has monitoring and an on-call rotation of sysadmins..." would be solved through improving the failure message reported to the user as discussed above.
We can’t have better failure messages because we don’t have any idea if a particular URL is expected to be up or if it has bit rotted to death and thus is an expected failure. Because of this pip has to more or less silently ignore failing URLs and ends up presenting very confusing error messages.

Forgive me if these don’t make sense, I’m real sick today.

Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20140723/607d9d8d/attachment-0001.html>

More information about the Distutils-SIG mailing list