Mailman 3 option #1 plus download_url scraping - Distutils-SIG

newer
Draft PEP for JSON based metadata...

option #1 plus download_url scraping

older
problems with sdist upload since...

Barry Warsaw

4 Jun 2013 4 Jun '13

5:16 p.m.

Like many of you, I got Donald's message about the changes to URLs for Cheeseshop packages. My question is about the three options; I think I want a middle ground, but I'm interested to see why you will discourage me from that <wink>. IIUC, option #1 is fine for packages hosted on PyPI. But what if our packages are *also* hosted elsewhere, say for redundancy purposes, and that external location needs to be scraped? Specifically, say I have a download_url in my setup.py. I *want* that url to be essentially a wildcard or index page because I don't want to have to change setup.py every time I make a release (unless of course `setup.py sdist` did it for me). I also can't add this url to the "Additional File URLs" page for my package because again I'd have to change it every time I do a release. So the middle ground I think I want is: option #1 plus scraping from download_url, but only download_url. Am I a horrible person for wanting this? Is there a better way. Cheers, -Barry

Attachments:

signature.asc (application/pgp-signature — 836 bytes)

Show replies by date

Noah Kantrowitz

4 Jun 4 Jun

5:23 p.m.

On Jun 4, 2013, at 3:16 PM, Barry Warsaw wrote:

...

Like many of you, I got Donald's message about the changes to URLs for Cheeseshop packages. My question is about the three options; I think I want a middle ground, but I'm interested to see why you will discourage me from that <wink>.

IIUC, option #1 is fine for packages hosted on PyPI. But what if our packages are *also* hosted elsewhere, say for redundancy purposes, and that external location needs to be scraped?

Specifically, say I have a download_url in my setup.py. I *want* that url to be essentially a wildcard or index page because I don't want to have to change setup.py every time I make a release (unless of course `setup.py sdist` did it for me). I also can't add this url to the "Additional File URLs" page for my package because again I'd have to change it every time I do a release.

So the middle ground I think I want is: option #1 plus scraping from download_url, but only download_url.

Am I a horrible person for wanting this? Is there a better way.

Do you mean you just don't want to update the version number in setup.py before you release? I'm a bit unsure of the reason for this. The goal is very specifically the hosting outside of PyPI is no longer encouraged. The reliability and performance of PyPI have enough of a track record now that "I want it on my own site just in case" no longer holds enough water to be worth the substantial downsides. --Noah

holger krekel

5:48 p.m.

On Tue, Jun 04, 2013 at 15:23 -0700, Noah Kantrowitz wrote:

...

On Jun 4, 2013, at 3:16 PM, Barry Warsaw wrote:

...
Like many of you, I got Donald's message about the changes to URLs for Cheeseshop packages. My question is about the three options; I think I want a middle ground, but I'm interested to see why you will discourage me from that <wink>.

IIUC, option #1 is fine for packages hosted on PyPI. But what if our packages are *also* hosted elsewhere, say for redundancy purposes, and that external location needs to be scraped?

Specifically, say I have a download_url in my setup.py. I *want* that url to be essentially a wildcard or index page because I don't want to have to change setup.py every time I make a release (unless of course `setup.py sdist` did it for me). I also can't add this url to the "Additional File URLs" page for my package because again I'd have to change it every time I do a release.

So the middle ground I think I want is: option #1 plus scraping from download_url, but only download_url.

Am I a horrible person for wanting this? Is there a better way.

Do you mean you just don't want to update the version number in setup.py before you release? I'm a bit unsure of the reason for this. The goal is very specifically the hosting outside of PyPI is no longer encouraged.

I agree with not encouraging "hosting outside" but i'd like to sidenote that PEP438 has a section discussing history/reasons for external hosting [1] and concludes with this: Irrespective of the present-day validity of these reasons, there clearly is a history why people choose to host files externally and it even was for some time the only way you could do things. This PEP takes the position that there remain some valid reasons for external hosting even today. As to Barry's need, what might be missing is a tool that helps to register links with a checksum. This would not require a change to setup.py but it would require an extra (scriptable) action when releasing a package. OTOH, just automating a bit of setup.py changes might be easier and would allow hosting on pypi which is usually more reliable for install time users. best, holger

...

The reliability and performance of PyPI have enough of a track record now that "I want it on my own site just in case" no longer holds enough water to be worth the substantial downsides.

--Noah

...

_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig

Barry Warsaw

5 Jun 5 Jun

9:25 a.m.

On Jun 04, 2013, at 03:23 PM, Noah Kantrowitz wrote:

...

Do you mean you just don't want to update the version number in setup.py before you release?

Correct, although on further reflection, if the alternative download site has predictable URLs, then it wouldn't be too difficult to calculate the new url for setup.py. I just don't want to have to change the version number in more than one place (which for my packages, is *not* the setup.py).

...

I'm a bit unsure of the reason for this. The goal is very specifically the hosting outside of PyPI is no longer encouraged. The reliability and performance of PyPI have enough of a track record now that "I want it on my own site just in case" no longer holds enough water to be worth the substantial downsides.

Fair enough. -Barry

Donald Stufft

11:16 a.m.

On Jun 5, 2013, at 10:25 AM, Barry Warsaw <barry@python.org> wrote:

...

On Jun 04, 2013, at 03:23 PM, Noah Kantrowitz wrote:

...
Do you mean you just don't want to update the version number in setup.py before you release?

Correct, although on further reflection, if the alternative download site has predictable URLs, then it wouldn't be too difficult to calculate the new url for setup.py. I just don't want to have to change the version number in more than one place (which for my packages, is *not* the setup.py).

...
I'm a bit unsure of the reason for this. The goal is very specifically the hosting outside of PyPI is no longer encouraged. The reliability and performance of PyPI have enough of a track record now that "I want it on my own site just in case" no longer holds enough water to be worth the substantial downsides.

Fair enough.

-Barry _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig

Where are you updating the version information at? And how are you generating a tarball so that it's name has the correct version in it? ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

Barry Warsaw

12:49 p.m.

On Jun 05, 2013, at 12:16 PM, Donald Stufft wrote:

...

Where are you updating the version information at? And how are you generating a tarball so that it's name has the correct version in it?

It depends on the package, but let's say it's in a version.txt file. Your implication is correct though - if setup.py is parsing that file to calculate the version key, it can also do the same and calculate the download_url value. -Barry

Donald Stufft

1:47 p.m.

On Jun 5, 2013, at 1:49 PM, Barry Warsaw <barry@python.org> wrote:

...

On Jun 05, 2013, at 12:16 PM, Donald Stufft wrote:

...
Where are you updating the version information at? And how are you generating a tarball so that it's name has the correct version in it?

It depends on the package, but let's say it's in a version.txt file. Your implication is correct though - if setup.py is parsing that file to calculate the version key, it can also do the same and calculate the download_url value.

-Barry

I'm really just trying to get a sense of your workflow to see if I can make any changes to improve the process for it. One of the big problems with download_url is that the data in setup.py is used in (and influences the content of) the final dist file. This means that inside of a setup.py you won't know what the hash of the final file is. So it's difficult for a setup.py based workflow with external urls to provide md5 sums for the files which means that pip and friends can't verify that no body modified the download in transit. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

Barry Warsaw

2:11 p.m.

On Jun 05, 2013, at 02:47 PM, Donald Stufft wrote:

...

I'm really just trying to get a sense of your workflow to see if I can make any changes to improve the process for it.

One of the big problems with download_url is that the data in setup.py is used in (and influences the content of) the final dist file. This means that inside of a setup.py you won't know what the hash of the final file is. So it's difficult for a setup.py based workflow with external urls to provide md5 sums for the files which means that pip and friends can't verify that no body modified the download in transit.

Let me explain what I (used to) do, and I'll let you decide whether anything needs to change. ;) When I've finally got my vcs into a releasable state, I'll generally do: $ python setup.py sdist upload -s As you know, this will create the tarball and signature file in dist, and upload everything nicely to the Cheeseshop. At this point, I go to my project's Launchpad page and push the big "I made a release" button. This fiddles some state on my project page, and it allows me to upload files attached to that particular release. The nice thing is that I can just upload the dist/*.tar.gz and dist/*.asc to add the tarball and signature to the Launchpad download page. E.g. https://launchpad.net/flufl.enum and https://launchpad.net/flufl.enum/+download The url is predictable (which is good because it also has to play nicely with Debian watch files), with option #3, I just added the index page to download_url and let clients scrape it. You'll see that it contains links to the md5 checksum and the locally generated signature. There must be some value to also allowing folks to download from Launchpad, as shown by the 1055 downloads of flufl.enum. Where are the PyPI download stats? -Barry

Donald Stufft

2:35 p.m.

On Jun 5, 2013, at 3:11 PM, Barry Warsaw <barry@python.org> wrote:

...

On Jun 05, 2013, at 02:47 PM, Donald Stufft wrote:

...
I'm really just trying to get a sense of your workflow to see if I can make any changes to improve the process for it.

One of the big problems with download_url is that the data in setup.py is used in (and influences the content of) the final dist file. This means that inside of a setup.py you won't know what the hash of the final file is. So it's difficult for a setup.py based workflow with external urls to provide md5 sums for the files which means that pip and friends can't verify that no body modified the download in transit.

Let me explain what I (used to) do, and I'll let you decide whether anything needs to change. ;)

When I've finally got my vcs into a releasable state, I'll generally do:

$ python setup.py sdist upload -s

As you know, this will create the tarball and signature file in dist, and upload everything nicely to the Cheeseshop. At this point, I go to my project's Launchpad page and push the big "I made a release" button. This fiddles some state on my project page, and it allows me to upload files attached to that particular release. The nice thing is that I can just upload the dist/*.tar.gz and dist/*.asc to add the tarball and signature to the Launchpad download page. E.g.

https://launchpad.net/flufl.enum

and

https://launchpad.net/flufl.enum/+download

The url is predictable (which is good because it also has to play nicely with Debian watch files), with option #3, I just added the index page to download_url and let clients scrape it. You'll see that it contains links to the md5 checksum and the locally generated signature.

There must be some value to also allowing folks to download from Launchpad, as shown by the 1055 downloads of flufl.enum. Where are the PyPI download stats?

-Barry _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig

Ah ok! I understand what you're trying to do now, thanks :) Right now download counts are disabled on PyPI due to some issues with the script that integrates them pegging the CPU, and then the CDN. But prior to that flufl.enum had 28196 downloads from PyPI (total across all versions). So Launchpad doesn't provide the md5 sums in a way that the tools will be able to process them, however you actually got lucky in that both your download url, and the files themselves are available via verifiable SSL so they aren't insecure if someone is using pip 1.3+ (and maybe newer easy_install? not sure the state of SSL outside of pip). I think the downloads you see are either people manually downloading it, or tools that don't prefer PyPI hosted urls that just happened to pick the launchpad url. I think for this the best option is to just continue uploading everything to PyPI and switch to #1 (which I think I saw you did). While launchpad is verifiable via SSL, and is unlikely to have bad uptime I don't think it provides any benefit for the folks installing your package so there's not much of a reason to keep it around on your /simple/ page. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

Nick Coghlan

5:08 p.m.

On 6 Jun 2013 04:49, "Donald Stufft" <donald@stufft.io> wrote:

...

On Jun 5, 2013, at 1:49 PM, Barry Warsaw <barry@python.org> wrote:

...
On Jun 05, 2013, at 12:16 PM, Donald Stufft wrote:

...
Where are you updating the version information at? And how are you

...

...
...
a tarball so that it's name has the correct version in it?

It depends on the package, but let's say it's in a version.txt file. Your implication is correct though - if setup.py is parsing that file to calculate the version key, it can also do the same and calculate the download_url value.

-Barry

I'm really just trying to get a sense of your workflow to see if I can make any changes to improve the process for it.

One of the big problems with download_url is that the data in setup.py is used in (and influences the content of) the final dist file. This means

generating that inside of a setup.py you won't know what the hash of the final file is. So it's difficult for a setup.py based workflow with external urls to provide md5 sums for the files which means that pip and friends can't verify that no body modified the download in transit. Hmm, I should mention this problem in PEP 426, and explicitly limit source_url to tarballs and VCS references. This self-referencing problem means it can't easily refer to a built sdist anyway, and the original source is preferred for distro packaging purposes. Cheers, Nick.

...

----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372

DCFA

...

_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig

PJ Eby

5:52 p.m.

On Wed, Jun 5, 2013 at 2:47 PM, Donald Stufft <donald@stufft.io> wrote:

...

One of the big problems with download_url is that the data in setup.py is used in (and influences the content of) the final dist file. This means that inside of a setup.py you won't know what the hash of the final file is. So it's difficult for a setup.py based workflow with external urls to provide md5 sums for the files which means that pip and friends can't verify that no body modified the download in transit.

Not if it's done in a setup.py command that runs after the distributions are built, akin to the way the upload command works now. If there were, say, an "uplink" command based on a modified version of upload, it could call the PyPI API to pass along hashed URLs. At some point I intend to write such a command so that my current snapshot scripts (which run on the server the downloads are hosted from) can update PyPI with properly hashed URLs. (But I'm not sure when "some point" will be, exactly, so if someone else writes it first I'll be a happy camper.)

Donald Stufft

5:56 p.m.

On Jun 5, 2013, at 6:52 PM, PJ Eby <pje@telecommunity.com> wrote:

...

On Wed, Jun 5, 2013 at 2:47 PM, Donald Stufft <donald@stufft.io> wrote:

...
One of the big problems with download_url is that the data in setup.py is used in (and influences the content of) the final dist file. This means that inside of a setup.py you won't know what the hash of the final file is. So it's difficult for a setup.py based workflow with external urls to provide md5 sums for the files which means that pip and friends can't verify that no body modified the download in transit.

Not if it's done in a setup.py command that runs after the distributions are built, akin to the way the upload command works now. If there were, say, an "uplink" command based on a modified version of upload, it could call the PyPI API to pass along hashed URLs.

At some point I intend to write such a command so that my current snapshot scripts (which run on the server the downloads are hosted from) can update PyPI with properly hashed URLs. (But I'm not sure when "some point" will be, exactly, so if someone else writes it first I'll be a happy camper.)

With static metadata ideally PyPI will be reading metadata from inside of the uploaded file and all that will be required is for publishing tools to push the file up. However something like your uplink command would (assuming I understand it correctly) work fine because those "additional urls to list on the /simple/ page" are not part of the package metadata. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

Donald Stufft

4 Jun 4 Jun

5:30 p.m.

I'm interested in the use case. How are generating a release without running setup.py sdist? On Jun 4, 2013, at 6:16 PM, Barry Warsaw <barry@python.org> wrote:

...

Like many of you, I got Donald's message about the changes to URLs for Cheeseshop packages. My question is about the three options; I think I want a middle ground, but I'm interested to see why you will discourage me from that <wink>.

IIUC, option #1 is fine for packages hosted on PyPI. But what if our packages are *also* hosted elsewhere, say for redundancy purposes, and that external location needs to be scraped?

Specifically, say I have a download_url in my setup.py. I *want* that url to be essentially a wildcard or index page because I don't want to have to change setup.py every time I make a release (unless of course `setup.py sdist` did it for me). I also can't add this url to the "Additional File URLs" page for my package because again I'd have to change it every time I do a release.

So the middle ground I think I want is: option #1 plus scraping from download_url, but only download_url.

Am I a horrible person for wanting this? Is there a better way.

Cheers, -Barry _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig

Carl Meyer

5:46 p.m.

On 06/04/2013 04:30 PM, Donald Stufft wrote:

...

I'm interested in the use case. How are generating a release without running setup.py sdist?

I think you misunderstood (the way Barry described it wasn't completely clear). If I'm reading it correctly, "I don't want to have to change setup.py every time I make a release" means "I don't want to have to change the Download-URL metadata in setup.py to point to a specific new sdist tarball URL every time I make a release; I want it to always point to the same index URL and have that index page scraped for new release tarball URLs." Carl

Barry Warsaw

5 Jun 5 Jun

10:04 a.m.

On Jun 04, 2013, at 04:46 PM, Carl Meyer wrote:

...

On 06/04/2013 04:30 PM, Donald Stufft wrote:

...

...
I'm interested in the use case. How are generating a release without running setup.py sdist?

I think you misunderstood (the way Barry described it wasn't completely clear). If I'm reading it correctly, "I don't want to have to change setup.py every time I make a release" means "I don't want to have to change the Download-URL metadata in setup.py to point to a specific new sdist tarball URL every time I make a release; I want it to always point to the same index URL and have that index page scraped for new release tarball URLs."

Correct, sorry for not being clear. -Barry

Carl Meyer

4 Jun 4 Jun

5:43 p.m.

Hi Barry, On 06/04/2013 04:16 PM, Barry Warsaw wrote:

...

Like many of you, I got Donald's message about the changes to URLs for Cheeseshop packages. My question is about the three options; I think I want a middle ground, but I'm interested to see why you will discourage me from that <wink>.

IIUC, option #1 is fine for packages hosted on PyPI. But what if our packages are *also* hosted elsewhere, say for redundancy purposes, and that external location needs to be scraped?

Specifically, say I have a download_url in my setup.py. I *want* that url to be essentially a wildcard or index page because I don't want to have to change setup.py every time I make a release (unless of course `setup.py sdist` did it for me). I also can't add this url to the "Additional File URLs" page for my package because again I'd have to change it every time I do a release.

So the middle ground I think I want is: option #1 plus scraping from download_url, but only download_url.

Am I a horrible person for wanting this? Is there a better way.

The first question, of course, is "why not just host on PyPI"? If "redundancy" is the real reason, you might think about whether that reason still applies with the new PyPI infrastructure, CDN, etc. But let's presume that whatever your reason for hosting off-PyPI, it's a good one. (To be clear, PEP 438 takes the position that there are and will continue to be some good reasons, and the option of off-PyPI hosting - in some form - should be supported indefinitely). The problem with the current system is that client installer tools do the scraping whenever they search for installation candidates for your package, which means that you are asking every user of your package to accept an unnecessary slowdown every single time they install. But the information on that download_url page should only change when you make a release, so the scraping should really be done just once, at release time, and the resulting sdist URL(s) stored on PyPI so that installers can take them into account without fetching or scraping any additional pages. So the idea is that to satisfy your use-case, there should be a tool that you can use at release-time to scrape your downloads page and automatically add sdist URLs found there to the explicit URLs list on PyPI. That tool, of course, doesn't exist yet :-) Until someone builds it, you'll have to stay with option #3 (and accept that you are slowing down installations for your users) to satisfy your use case. Carl

Donald Stufft

6:15 p.m.

On Jun 4, 2013, at 6:16 PM, Barry Warsaw <barry@python.org> wrote:

...

Like many of you, I got Donald's message about the changes to URLs for Cheeseshop packages. My question is about the three options; I think I want a middle ground, but I'm interested to see why you will discourage me from that <wink>.

IIUC, option #1 is fine for packages hosted on PyPI. But what if our packages are *also* hosted elsewhere, say for redundancy purposes, and that external location needs to be scraped?

Specifically, say I have a download_url in my setup.py. I *want* that url to be essentially a wildcard or index page because I don't want to have to change setup.py every time I make a release (unless of course `setup.py sdist` did it for me). I also can't add this url to the "Additional File URLs" page for my package because again I'd have to change it every time I do a release.

So the middle ground I think I want is: option #1 plus scraping from download_url, but only download_url.

Am I a horrible person for wanting this? Is there a better way.

Cheers, -Barry _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig

I was originally on my phone and am now back at the computer, so I can give a longer reply now. So my first question is as to your actual use case, what are you attempting to achieve by hosting externally instead of on PyPI? It's likely there's a better way but what problem are you actually attempting to solve :) You mention reliability but as far as I can tell it's basically impossible to add more reliability to the system via external urls. The only method an installation client has to discover your external urls is PyPI, and if PyPI is up to enable them to discover them, then it should also be up to enable them to download directly from PyPI. Additionally except in one specific circumstance it's also a major security issue. Installers can download the /simple/ pages via verified TLS, and then use that to verify the hashes on that page to verify the download file. When you're scraping an external page the only time that is *safe* to do is if that page is a) served via verified TLS and b) has a supported hash fragment for every single file an installer might attempt to download. Furthermore the scraping adds an extreme amount of time to the installation. I recently did basically what pip does sans downloading the actual packages across all of PyPI. So I processed every /simple/ page, looked on it for other pages to scrape, downloaded and scraped those. That process took about 3 days to complete. If I run the same process but simulating a world where everyone was using #1, the process takes about 10 minutes to complete. The PEP concludes that there are valid reasons to host externally, but I'm of the opinion if there is a valid reason, it is an extreme edge case and likely would be better solved another way. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA