
Hi,
to have information about available packages, PyCharm IDE currently parses the PyPI index page (https://pypi.python.org/pypi?%3Aaction=index). As it is going to be deprecated soon, we are looking for a workaround.
What we need is, making one request, to get the name and the version of all PyPI packages. Then we cache this information in the IDE ( https://github.com/JetBrains/intellij-community/blob/7e16c042a19767d5f548c84... ).
What official API could you advise us to look at?
Any hint is appreciated.
Best regards, Dmitry

On Jul 13, 2016, at 2:43 PM, Dmitry Trofimov dmitry.trofimov@jetbrains.com wrote:
Hi,
to have information about available packages, PyCharm IDE currently parses the PyPI index page (https://pypi.python.org/pypi?%3Aaction=index https://pypi.python.org/pypi?%3Aaction=index). As it is going to be deprecated soon, we are looking for a workaround.
What we need is, making one request, to get the name and the version of all PyPI packages. Then we cache this information in the IDE (https://github.com/JetBrains/intellij-community/blob/7e16c042a19767d5f548c84... https://github.com/JetBrains/intellij-community/blob/7e16c042a19767d5f548c84f88cc5edd5f9d1721/python/src/com/jetbrains/python/packaging/PyPIPackageUtil.java).
By name and version, do you mean the latest version?
— Donald Stufft

I'm sorry, I should have posted my commentary here, not in the separate thread.
We have some issues with suggested "/simple" endpoint. Despite the need to scrap the web page, old endpoint allowed us to quickly find latest versions of the packages hosted on PyPI. We did a single request on IDE startup and showed outdated installed packages in the settings later. Index "/simple" however contains only package names and links to the dedicated pages with their artifacts (not for each of them, though). It means that now we have to make tons of individual requests to find the latest published version for each installed package. Isn't it going to load the service even worse?
So, yes, we're interested most in the latest version of a package.
2016-07-13 21:57 GMT+03:00 Donald Stufft donald@stufft.io:
On Jul 13, 2016, at 2:43 PM, Dmitry Trofimov < dmitry.trofimov@jetbrains.com> wrote:
Hi,
to have information about available packages, PyCharm IDE currently parses the PyPI index page (https://pypi.python.org/pypi?%3Aaction=index). As it is going to be deprecated soon, we are looking for a workaround.
What we need is, making one request, to get the name and the version of all PyPI packages. Then we cache this information in the IDE ( https://github.com/JetBrains/intellij-community/blob/7e16c042a19767d5f548c84... ).
By name and version, do you mean the latest version?
— Donald Stufft
Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig

On Jul 13, 2016, at 3:12 PM, Михаил Голубев qsolo825@gmail.com wrote:
I'm sorry, I should have posted my commentary here, not in the separate thread.
We have some issues with suggested "/simple" endpoint. Despite the need to scrap the web page, old endpoint allowed us to quickly find latest versions of the packages hosted on PyPI. We did a single request on IDE startup and showed outdated installed packages in the settings later. Index "/simple" however contains only package names and links to the dedicated pages with their artifacts (not for each of them, though). It means that now we have to make tons of individual requests to find the latest published version for each installed package. Isn't it going to load the service even worse?
So, yes, we're interested most in the latest version of a package.
Ok, we don’t currently have an API like that (largely because nobody has come up with a use case that was pressing enough to need to devote resources to it). It was requested though, and is being tracked by https://github.com/pypa/warehouse/issues/347 https://github.com/pypa/warehouse/issues/347. This is likely enough to pull this issue onto my radar as sooner rather than later issue.
— Donald Stufft

Ok, we don’t currently have an API like that (largely because nobody has come up with a use case that was pressing enough to need to devote resources to it). It was requested though, and is being tracked by https://github.com/pypa/warehouse/issues/347. This is likely enough to pull this issue onto my radar as sooner rather than later issue.
Does that mean that PyPI index page will live for a while until the new API is implemented?
On Wed, Jul 13, 2016 at 9:25 PM, Donald Stufft donald@stufft.io wrote:
On Jul 13, 2016, at 3:12 PM, Михаил Голубев qsolo825@gmail.com wrote:
I'm sorry, I should have posted my commentary here, not in the separate thread.
We have some issues with suggested "/simple" endpoint. Despite the need to scrap the web page, old endpoint allowed us to quickly find latest versions of the packages hosted on PyPI. We did a single request on IDE startup and showed outdated installed packages in the settings later. Index "/simple" however contains only package names and links to the dedicated pages with their artifacts (not for each of them, though). It means that now we have to make tons of individual requests to find the latest published version for each installed package. Isn't it going to load the service even worse?
So, yes, we're interested most in the latest version of a package.
Ok, we don’t currently have an API like that (largely because nobody has come up with a use case that was pressing enough to need to devote resources to it). It was requested though, and is being tracked by https://github.com/pypa/warehouse/issues/347. This is likely enough to pull this issue onto my radar as sooner rather than later issue.
— Donald Stufft

On Jul 13, 2016, at 3:40 PM, Dmitry Trofimov dmitry.trofimov@jetbrains.com wrote:
Does that mean that PyPI index page will live for a while until the new API is implemented?
Yes, though I’m looking at this right now.
I do have a question here though. If I understand the dialog, this is to provide a way for people to upgrade packages they have installed, and to tell them if their is a newer version or not. So my question here is why do you need the latest version for *every* package instead of just the ones you have installed?
If you narrow it down to just the ones that are installed, then the number of HTTP requests needed with the current APIs goes down from ~80,000 to likely <100 or even <50 in most cases.
— Donald Stufft

Right, sorry, that initial question wasn't clear about that.
We need the latest versions only for installed packages. Nonetheless, as you noted, it's still several dozens consecutive requests to "/simple/<package_name>" for each PyCharm session of every user.
Can you handle that?
2016-07-13 22:56 GMT+03:00 Donald Stufft donald@stufft.io:
On Jul 13, 2016, at 3:40 PM, Dmitry Trofimov < dmitry.trofimov@jetbrains.com> wrote:
Does that mean that PyPI index page will live for a while until the new API is implemented?
Yes, though I’m looking at this right now.
I do have a question here though. If I understand the dialog, this is to provide a way for people to upgrade packages they have installed, and to tell them if their is a newer version or not. So my question here is why do you need the latest version for *every* package instead of just the ones you have installed?
If you narrow it down to just the ones that are installed, then the number of HTTP requests needed with the current APIs goes down from ~80,000 to likely <100 or even <50 in most cases.
— Donald Stufft

On Jul 13, 2016, at 4:21 PM, Михаил Голубев qsolo825@gmail.com wrote:
Right, sorry, that initial question wasn't clear about that.
We need the latest versions only for installed packages. Nonetheless, as you noted, it's still several dozens consecutive requests to "/simple/<package_name>" for each PyCharm session of every user.
Can you handle that?
The short answer is yes.
The longer answer is, that we have Fastly acting as a CDN in front of PyPI and serving an item out of the cache in Fastly is essentially free for us in terms of resources (obviously Fastly needs to handle that load, but they’re well equipped to handle much larger loads than we are). Thus, the more cacheable (and the longer lived a particular cache item can be) the easier it is for us to scale a particular URL on PyPI.
The url you’re currently using has a view downsides that prevent it from being able to be cached effectively:
* The URL is a “UI” URL, so it includes information like current logged in user and thus we need to Vary: Cookie which means it’s less likely to be cached at all since each unique cookie header adds another response to be cached for that URL, and Fastly will only save ~200 responses per URL before it starts to evict some.
* Similarly to above, since it’s a “UI” URL people expect it to update fairly quickly, because legacy PyPI wasn’t implemented with long lived caching with purging on updates in mind, it was easier to simply implement it with a short (5 minute IIRC) TTL on the cached object rather than long lived TTLs with purging (as we do in the “API” urls).
* Responses that act as collections of projects need to be invalidated anytime something changes that may invalidate that collection. In an API that lists every project and the latest version, that means it needs to be invalidated anytime something releases a new version.
Compare that to looking at /simple/ and then either accessing /simple/<foo>/ or /pypi/<foo>/json (all of which are cached for long periods of time and purged on demand).
* None of those are “UI” URLs, so they have long cache times and they do not Vary on Cookie.
* For /simple/ we don’t list any versions we only list projects themselves. This means that we only need to invalidate this page whenever a brand new project is added to PyPI or an existing project is completely deleted. This occurs far less than someone releasing an existing project.
* For /simple/ we don’t need to do any particularly heavy duty querying, it’s a simple select on an ~80k length table (versus a select on an 80k length table, with a join to a 500k length table) and is fairly quick to render.
* For /simple/<foo>/ and /pypi/<foo>/json these are scoped to an individual project, so they can be cached for a very long time and only invalidated when that particular project releases, not when _any_ project releases. This means that the likelihood we can serve one of these out of cache is VERY high.
* For /simple/<foo>/ and /pypi/<foo>/json our SQL queries are relatively quick because they don’t need to operate over the entire table, but only over the records for one single project.
Given all of the above, and the fact that listing every project and their latest version is *slow* and resource intensive, yes it’s very likely that doing that will be far better for our ability to serve your requests, because the extra requests will almost certainly be able to be served straight from the Fastly caches and never hit our origin servers at all.
— Donald Stufft

On Jul 13, 2016, at 4:21 PM, Михаил Голубев qsolo825@gmail.com wrote:
Can you handle that?
Oh, and just to put things in scale in the past 30 days:
* PyPI has served > 3 billion HTTP requests. * PyPI has served > 327TB of bandwidth. * The 95%tile for cache hit vs cache miss is 92%. * We regularly serve >1,000 concurrent requests - https://s.caremad.io/QDTlK0mRj7/ https://s.caremad.io/QDTlK0mRj7/
— Donald Stufft

Ok, you convinced me that these extra requests from PyCharm won't cause you any problems. Impressive stats, by the way :)
We will focus on migrating our packaging-related features to these new endpoints; hopefully, it won't take long. Note, however, that we need to prepare updates for already released versions of PyCharm. We'll let you know as soon as everything is ready.
Ernest W. Durbin III suggested changing User-Agent, so that it's clear which requests come from PyCharm. To me it seems a fair point.
Batch API, as mentioned by Steve Dower, are very welcome, anyway. Also "/simple" index is still HTML page. Honestly, it's a bit cumbersome that this information can be received only by scraping HTML and for everything else there are JSON REST API and XML-RPC.
Is anyone from PyPA attending to EuroPython next week? We could discuss these matters further there.
2016-07-13 23:54 GMT+03:00 Donald Stufft donald@stufft.io:
On Jul 13, 2016, at 4:21 PM, Михаил Голубев qsolo825@gmail.com wrote:
Can you handle that?
Oh, and just to put things in scale in the past 30 days:
- PyPI has served > 3 billion HTTP requests.
- PyPI has served > 327TB of bandwidth.
- The 95%tile for cache hit vs cache miss is 92%.
- We regularly serve >1,000 concurrent requests -
https://s.caremad.io/QDTlK0mRj7/
— Donald Stufft

On Jul 14, 2016, at 5:30 AM, Михаил Голубев qsolo825@gmail.com wrote:
Ok, you convinced me that these extra requests from PyCharm won't cause you any problems. Impressive stats, by the way :)
We will focus on migrating our packaging-related features to these new endpoints; hopefully, it won't take long. Note, however, that we need to prepare updates for already released versions of PyCharm. We'll let you know as soon as everything is ready.
Ernest W. Durbin III suggested changing User-Agent, so that it's clear which requests come from PyCharm. To me it seems a fair point.
Batch API, as mentioned by Steve Dower, are very welcome, anyway. Also "/simple" index is still HTML page. Honestly, it's a bit cumbersome that this information can be received only by scraping HTML and for everything else there are JSON REST API and XML-RPC.
Yea, I plan on a new “next gen” API in Warehouse at some point that will be much cleaner overall and not require multiple different formats to use :). For the record, XML-RPC should be avoided where possible as well, we also can’t cache that in the CDN (because it’s a POST request to the same URL for all routes, and the CDN can’t inspect the body of a POST request to determine cache key).
Is anyone from PyPA attending to EuroPython next week? We could discuss these matters further there.
I’m not. I’m not sure if anyone else is.
— Donald Stufft

I'm also interested (for the same support in Visual Studio) though we're unaffected by this change.
A batch API to get info for many packages would be great. Currently we scrape simple and then post JSON queries for individual packages.
Cheers, Steve
Top-posted from my Windows Phone
-----Original Message----- From: "Михаил Голубев" qsolo825@gmail.com Sent: 7/13/2016 13:04 To: "Donald Stufft" donald@stufft.io Cc: "distutils-sig@python.org" distutils-sig@python.org Subject: Re: [Distutils] PyPI index workaround
I'm sorry, I should have posted my commentary here, not in the separate thread.
We have some issues with suggested "/simple" endpoint. Despite the need to scrap the web page, old endpoint allowed us to quickly find latest versions of the packages hosted on PyPI. We did a single request on IDE startup and showed outdated installed packages in the settings later. Index "/simple" however contains only package names and links to the dedicated pages with their artifacts (not for each of them, though). It means that now we have to make tons of individual requests to find the latest published version for each installed package. Isn't it going to load the service even worse?
So, yes, we're interested most in the latest version of a package.
2016-07-13 21:57 GMT+03:00 Donald Stufft donald@stufft.io:
On Jul 13, 2016, at 2:43 PM, Dmitry Trofimov dmitry.trofimov@jetbrains.com wrote:
Hi,
to have information about available packages, PyCharm IDE currently parses the PyPI index page (https://pypi.python.org/pypi?%3Aaction=index). As it is going to be deprecated soon, we are looking for a workaround.
What we need is, making one request, to get the name and the version of all PyPI packages. Then we cache this information in the IDE (https://github.com/JetBrains/intellij-community/blob/7e16c042a19767d5f548c84...).
By name and version, do you mean the latest version?
— Donald Stufft
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
participants (4)
-
Dmitry Trofimov
-
Donald Stufft
-
Steve Dower
-
Михаил Голубев