[Python-Dev] pip: cdecimal an externally hosted file and may be unreliable [sic]

Fri May 9 22:38:27 CEST 2014

On May 9, 2014, at 4:20 PM, Terry Reedy <tjreedy at udel.edu> wrote:

> On 5/9/2014 2:12 PM, Donald Stufft wrote:
>> 
>> On May 9, 2014, at 1:28 PM, R. David Murray <rdmurray at bitdance.com> wrote:
> 
>>> I don't understand this.  Why it is our responsibility to provide a
>>> free service for a large project to repeatedly download a set of files
>>> they need?  Why does it not make more sense for them to download them
>>> once, and only update their local copies when they change?  That's almost
>>> completely orthogonal to making the service we do provide reliable.
>> 
>> Well here’s the thing right. The large projects repeatedly downloading the
>> same set of files is a canary. If any particular project goes uninstallable on
>> PyPI (or if PyPI itself goes down) then nobody can install it, the people
>> installing things over and over every day or the people who just happened
>> to be installing it during that downtime. However intermittent failures and
>> general insatiability is going to be noticed by the projects who install things
>> over and over again quicker and easier and thus it becomes a lot easier
>> to use them as a general gauge for what the average “uptime” is.
> 
> I have had the same question as David, so I also appreciate your answer.
> 
>> IOW if PyPI goes unavailable for 10 minutes 5 times a day, you might get
>> a handful of “small” installers (e.g. not the big projects) in each downtime,
>> but a different set who are likely to shrug it off and just call treat it as the
>> norm even though it’s very disruptive to what they’re doing. However the
>> big project is highly likely to hit every single one of those downtimes and
>> be able to say “wow PyPI is flaky as hell”.
>> 
>> To expand further on that if we assume that we want ``pip install <foo>``
>> to be reliable and not work sometimes and work at other times then we’re
>> aiming for as high as uptime as possible. PyPI gets enough traffic that
>> any single large project isn’t a noticeable drop in our bucket and due to the
>> way our caching works it actually helps us to be faster and more reliable
>> to have people constantly hitting packages because they’ll be in cache
>> and able to be served without hitting the Origin servers.
>> 
>> Just for the record, PyPI gets roughly 350 req/s basically 24/7, in the
>> month of April we served 71.4TB of data with 877.4 million requests of
>> which 80.5% never made it to the actual servers that run PyPI and were
>> served directly out of the geo distributed CDN that sits in front of PyPI. We
>> are vastly better positioned to maintain a reliable infrastructure than ask
>> that every large project that uses Python to do the same.
> 
>> The reason that it’s our responsibility for providing it is because we chose
>> to provide it. There isn’t a moral imperative to run PyPI, but running PyPI
>> badly seems like a crummy thing to do.
> 
> Agreed.
> 
>>> For perspective, Gentoo requests that people only do an emerge sync at
>>> most once a day, and if they have multiple machines to update, that they
>>> only do one pull, and they update the rest of their infrastructure from
>>> their local copy.
>> 
>> To be clear, there are other reasons to run a local mirror but I don’t think that
>> it’s reasonable to expect anyone who wants a reliable install using pip to
>> stand up their own infrastructure.
> 
> Ok, you are not saying that caching is bad, but that having everyone reinvent caching, and possibly doing it badly, or at least not in thebest way, is bad.

Yea, caching isn’t in general a bad thing, and actually PyPI uses it heavily. All
access to /simple/ and /packages/ is cached for 24 hours by our CDN unless
someone uploads or deletes a file, in which case we selectively purge those
URLs from the CDN cache so that they (nearly) instantly get the updated
results.

Warehouse (aka PyPI 2.0) is designed to utilize our CDN cache even further
and I’m hoping to get our cache rate even higher using it.

> 
>> Further to this point here I’m currently working on adding caching by default
>> for pip so that we minimize how often different people hit PyPI and we do it
>> automatically and in a way that doesn’t generally require people to think about
>> it and that also doesn’t require them to stand up their own infra.
> 
> This seems like the right solution. It would sort of make each machine a micro-CDN node.

Yes, it’s bog standard HTTP stuff just like a browser does it. The major difference is we’re
limiting the maximum lifetime of a cache item in the client (pip) for the index pages but
we are not doing that for the package files themselves. This is done to prevent a misconfigured
server from causing pip to not see new versions for hours/days/weeks/years/whatever.

Additionally this change also includes making pip smarter about HTTP requests in that if
we have a stale item in the cache which as a Last-Modified or an ETag header we’ll do
a conditional GET which will hopefully be returned with an HTTP 304 Not Modified so that
we can simply refresh the stale item in the cache and use it again instead of needing to
download an entire response body again.

-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140509/2dddd3c9/attachment.sig>