[Distutils] option #1 plus download_url scraping

Wed Jun 5 01:15:16 CEST 2013

On Jun 4, 2013, at 6:16 PM, Barry Warsaw <barry at python.org> wrote:

> Like many of you, I got Donald's message about the changes to URLs for
> Cheeseshop packages.  My question is about the three options; I think I want a
> middle ground, but I'm interested to see why you will discourage me from that
> <wink>.
> 
> IIUC, option #1 is fine for packages hosted on PyPI.  But what if our packages
> are *also* hosted elsewhere, say for redundancy purposes, and that external
> location needs to be scraped?
> 
> Specifically, say I have a download_url in my setup.py.  I *want* that url to
> be essentially a wildcard or index page because I don't want to have to change
> setup.py every time I make a release (unless of course `setup.py sdist` did it
> for me).  I also can't add this url to the "Additional File URLs" page for my
> package because again I'd have to change it every time I do a release.
> 
> So the middle ground I think I want is: option #1 plus scraping from
> download_url, but only download_url.
> 
> Am I a horrible person for wanting this?  Is there a better way.
> 
> Cheers,
> -Barry
> _______________________________________________
> Distutils-SIG maillist  -  Distutils-SIG at python.org
> http://mail.python.org/mailman/listinfo/distutils-sig

I was originally on my phone and am now back at the computer, so I can give a longer reply now.

So my first question is as to your actual use case, what are you attempting to achieve by hosting externally instead of on PyPI? It's likely there's a better way but what problem are you actually attempting to solve :)

You mention reliability but as far as I can tell it's basically impossible to add more reliability to the system via external urls. The only method an installation client has to discover your external urls is PyPI, and if PyPI is up to enable them to discover them, then it should also be up to enable them to download directly from PyPI.

Additionally except in one specific circumstance it's also a major security issue. Installers can download the /simple/ pages via verified TLS, and then use that to verify the hashes on that page to verify the download file. When you're scraping an external page the only time that is *safe* to do is if that page is a) served via verified TLS and b) has a supported hash fragment for every single file an installer might attempt to download.

Furthermore the scraping adds an extreme amount of time to the installation. I recently did basically what pip does sans downloading the actual packages across all of PyPI. So I processed every /simple/ page, looked on it for other pages to scrape, downloaded and scraped those. That process took about 3 days to complete. If I run the same process but simulating a world where everyone was using #1, the process takes about 10 minutes to complete.

The PEP concludes that there are valid reasons to host externally, but I'm of the opinion if there is a valid reason, it is an extreme edge case and likely would be better solved another way.

-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20130604/3f109bed/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20130604/3f109bed/attachment.pgp>