On May 9, 2014, at 4:20 PM, Terry Reedy email@example.com wrote:
On 5/9/2014 2:12 PM, Donald Stufft wrote:
On May 9, 2014, at 1:28 PM, R. David Murray firstname.lastname@example.org wrote:
I don't understand this. Why it is our responsibility to provide a free service for a large project to repeatedly download a set of files they need? Why does it not make more sense for them to download them once, and only update their local copies when they change? That's almost completely orthogonal to making the service we do provide reliable.
Well here’s the thing right. The large projects repeatedly downloading the same set of files is a canary. If any particular project goes uninstallable on PyPI (or if PyPI itself goes down) then nobody can install it, the people installing things over and over every day or the people who just happened to be installing it during that downtime. However intermittent failures and general insatiability is going to be noticed by the projects who install things over and over again quicker and easier and thus it becomes a lot easier to use them as a general gauge for what the average “uptime” is.
I have had the same question as David, so I also appreciate your answer.
IOW if PyPI goes unavailable for 10 minutes 5 times a day, you might get a handful of “small” installers (e.g. not the big projects) in each downtime, but a different set who are likely to shrug it off and just call treat it as the norm even though it’s very disruptive to what they’re doing. However the big project is highly likely to hit every single one of those downtimes and be able to say “wow PyPI is flaky as hell”.
To expand further on that if we assume that we want
to be reliable and not work sometimes and work at other times then we’re
aiming for as high as uptime as possible. PyPI gets enough traffic that
any single large project isn’t a noticeable drop in our bucket and due to the
way our caching works it actually helps us to be faster and more reliable
to have people constantly hitting packages because they’ll be in cache
and able to be served without hitting the Origin servers.
Just for the record, PyPI gets roughly 350 req/s basically 24/7, in the month of April we served 71.4TB of data with 877.4 million requests of which 80.5% never made it to the actual servers that run PyPI and were served directly out of the geo distributed CDN that sits in front of PyPI. We are vastly better positioned to maintain a reliable infrastructure than ask that every large project that uses Python to do the same.
The reason that it’s our responsibility for providing it is because we chose to provide it. There isn’t a moral imperative to run PyPI, but running PyPI badly seems like a crummy thing to do.
For perspective, Gentoo requests that people only do an emerge sync at most once a day, and if they have multiple machines to update, that they only do one pull, and they update the rest of their infrastructure from their local copy.
To be clear, there are other reasons to run a local mirror but I don’t think that it’s reasonable to expect anyone who wants a reliable install using pip to stand up their own infrastructure.
Ok, you are not saying that caching is bad, but that having everyone reinvent caching, and possibly doing it badly, or at least not in thebest way, is bad.
Yea, caching isn’t in general a bad thing, and actually PyPI uses it heavily. All access to /simple/ and /packages/ is cached for 24 hours by our CDN unless someone uploads or deletes a file, in which case we selectively purge those URLs from the CDN cache so that they (nearly) instantly get the updated results.
Warehouse (aka PyPI 2.0) is designed to utilize our CDN cache even further and I’m hoping to get our cache rate even higher using it.
Further to this point here I’m currently working on adding caching by default for pip so that we minimize how often different people hit PyPI and we do it automatically and in a way that doesn’t generally require people to think about it and that also doesn’t require them to stand up their own infra.
This seems like the right solution. It would sort of make each machine a micro-CDN node.
Yes, it’s bog standard HTTP stuff just like a browser does it. The major difference is we’re limiting the maximum lifetime of a cache item in the client (pip) for the index pages but we are not doing that for the package files themselves. This is done to prevent a misconfigured server from causing pip to not see new versions for hours/days/weeks/years/whatever.
Additionally this change also includes making pip smarter about HTTP requests in that if we have a stale item in the cache which as a Last-Modified or an ETag header we’ll do a conditional GET which will hopefully be returned with an HTTP 304 Not Modified so that we can simply refresh the stale item in the cache and use it again instead of needing to download an entire response body again.
Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA