On May 9, 2014, at 1:28 PM, R. David Murray firstname.lastname@example.org wrote:
On Fri, 09 May 2014 11:39:02 -0400, Donald Stufft email@example.com wrote:
On May 9, 2014, at 9:58 AM, M.-A. Lemburg firstname.lastname@example.org wrote:
On 09.05.2014 13:44, Donald Stufft wrote:
On May 9, 2014, at 4:12 AM, M.-A. Lemburg email@example.com wrote: I snipped the rest of the discussion and reliability, using unmaintained packages and projects using their own mirrors (which should really be the standard, not an exceptional case), because it's not really leading anywhere:
Using your own mirror shouldn’t be the standard if all you’re doing is automatically updating that mirror. It’s a hack to get around unreliability and it should be seen of as a sign of a failure to provide a service that people can rely on and that’s how I see it. People depend on this service and it’s irresponsible to not treat it as a critical piece of infrastructure.
I don't understand this. Why it is our responsibility to provide a free service for a large project to repeatedly download a set of files they need? Why does it not make more sense for them to download them once, and only update their local copies when they change? That's almost completely orthogonal to making the service we do provide reliable.
Well here’s the thing right. The large projects repeatedly downloading the same set of files is a canary. If any particular project goes uninstallable on PyPI (or if PyPI itself goes down) then nobody can install it, the people installing things over and over every day or the people who just happened to be installing it during that downtime. However intermittent failures and general insatiability is going to be noticed by the projects who install things over and over again quicker and easier and thus it becomes a lot easier to use them as a general gauge for what the average “uptime” is.
IOW if PyPI goes unavailable for 10 minutes 5 times a day, you might get a handful of “small” installers (e.g. not the big projects) in each downtime, but a different set who are likely to shrug it off and just call treat it as the norm even though it’s very disruptive to what they’re doing. However the big project is highly likely to hit every single one of those downtimes and be able to say “wow PyPI is flaky as hell”.
To expand further on that if we assume that we want
to be reliable and not work sometimes and work at other times then we’re
aiming for as high as uptime as possible. PyPI gets enough traffic that
any single large project isn’t a noticeable drop in our bucket and due to the
way our caching works it actually helps us to be faster and more reliable
to have people constantly hitting packages because they’ll be in cache
and able to be served without hitting the Origin servers.
Just for the record, PyPI gets roughly 350 req/s basically 24/7, in the month of April we served 71.4TB of data with 877.4 million requests of which 80.5% never made it to the actual servers that run PyPI and were served directly out of the geo distributed CDN that sits in front of PyPI. We are vastly better positioned to maintain a reliable infrastructure than ask that every large project that uses Python to do the same.
The reason that it’s our responsibility for providing it is because we chose to provide it. There isn’t a moral imperative to run PyPI, but running PyPI badly seems like a crummy thing to do.
For perspective, Gentoo requests that people only do an emerge sync at most once a day, and if they have multiple machines to update, that they only do one pull, and they update the rest of their infrastructure from their local copy.
To be clear, there are other reasons to run a local mirror but I don’t think that it’s reasonable to expect anyone who wants a reliable install using pip to stand up their own infrastructure.
Further to this point here I’m currently working on adding caching by default for pip so that we minimize how often different people hit PyPI and we do it automatically and in a way that doesn’t generally require people to think about it and that also doesn’t require them to stand up their own infra.
As another point of information for comparison, Gentoo downloads files from wherever they are hosted first, and only if that fails falls back to a Gentoo provided mirror (if I remember correctly...I think the Gentoo mirror copy doesn't always exist?).
Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/donald%40stufft.io
Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA