[Distutils] PyPI index workaround

Donald Stufft donald at stufft.io
Wed Jul 13 16:46:57 EDT 2016


> On Jul 13, 2016, at 4:21 PM, Михаил Голубев <qsolo825 at gmail.com> wrote:
> 
> Right, sorry, that initial question wasn't clear about that. 
> 
> We need the latest versions only for installed packages. Nonetheless, as you noted, it's still several dozens consecutive requests to "/simple/<package_name>" for each PyCharm session of every user. 
> 
> Can you handle that?


The short answer is yes.

The longer answer is, that we have Fastly acting as a CDN in front of PyPI and serving an item out of the cache in Fastly is essentially free for us in terms of resources (obviously Fastly needs to handle that load, but they’re well equipped to handle much larger loads than we are). Thus, the more cacheable (and the longer lived a particular cache item can be) the easier it is for us to scale a particular URL on PyPI.

The url you’re currently using has a view downsides that prevent it from being able to be cached effectively:

* The URL is a “UI” URL, so it includes information like current logged in user and thus we need to Vary: Cookie which means it’s less likely to be cached at all since each unique cookie header adds another response to be cached for that URL, and Fastly will only save ~200 responses per URL before it starts to evict some.

* Similarly to above, since it’s a “UI” URL people expect it to update fairly quickly, because legacy PyPI wasn’t implemented with long lived caching with purging on updates in mind, it was easier to simply implement it with a short (5 minute IIRC) TTL on the cached object rather than long lived TTLs with purging (as we do in the “API” urls).

* Responses that act as collections of projects need to be invalidated anytime something changes that may invalidate that collection. In an API that lists every project and the latest version, that means it needs to be invalidated anytime something releases a new version.

Compare that to looking at /simple/ and then either accessing /simple/<foo>/ or /pypi/<foo>/json (all of which are cached for long periods of time and purged on demand).

* None of those are “UI” URLs, so they have long cache times and they do not Vary on Cookie.

* For /simple/ we don’t list any versions we only list projects themselves. This means that we only need to invalidate this page whenever a brand new project is added to PyPI or an existing project is completely deleted. This occurs far less than someone releasing an existing project.

* For /simple/ we don’t need to do any particularly heavy duty querying, it’s a simple select on an ~80k length table (versus a select on an 80k length table, with a join to a 500k length table) and is fairly quick to render.

* For /simple/<foo>/ and /pypi/<foo>/json these are scoped to an individual project, so they can be cached for a very long time and only invalidated when that particular project releases, not when _any_ project releases. This means that the likelihood we can serve one of these out of cache is VERY high.

* For /simple/<foo>/ and /pypi/<foo>/json our SQL queries are relatively quick because they don’t need to operate over the entire table, but only over the records for one single project.

Given all of the above, and the fact that listing every project and their latest version is *slow* and resource intensive, yes it’s very likely that doing that will be far better for our ability to serve your requests, because the extra requests will almost certainly be able to be served straight from the Fastly caches and never hit our origin servers at all.

—
Donald Stufft



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20160713/bae90c43/attachment.html>


More information about the Distutils-SIG mailing list