[Distutils] Fwd: The state of PyPI

Tarek Ziadé ziade.tarek at gmail.com
Tue Sep 27 19:25:11 CEST 2011

On Tue, Sep 27, 2011 at 5:35 PM, Jim Fulton <jim at zope.com> wrote:

>>> I understand where you're coming from but, ..

>> Sorry, I don't understand what you imply here.

> I understand why you don't want to rely on a proprietary solution.

But it's true that I don't want to rely on a proprietary solution.
That's based on a good reason I think, mentioned at the end of this

>> If you're saying that CloudFront is proven technology and that we
>> should not worry about relying on them, then I think we can do better
>> for the community to get locked-in for this, and continue to work on
>> an open protocol where everyone can participate by providing a spare
>> server.  But maybe that's just me ?
> It's nice to have a hobby. :)

I think you've missed what we, bunch of hobbyists, did in the past two years

+ 5 community mirrors are up and running, collecting download stats
that get merged
+ pip does work with the mirrors, and offer fallback options

It's too bad you were not there to tell us we were wasting our time
and how awesome CloudFront was ;)

But at this point, the shortest road to a better PyPI is to add the
mirroring support to other clients, pip showed the lead. And if
zc.buildout uses Distribute, it should get this feature at some point.

But having a CloudFront-based PyPI could also be interesting in
parallel, I am not saying it's not. But the project is stalled, and
has the defaults I've mentioned.

> But I don't want to have to update buildout *just* because of an itch
> to have a custom protocol.

I kind of wonder how hard it would be to have a standalone pypi
download client, ripped off from python 3.3's packaging, so you would
not have to worry about this.

And, well, you do not sound like you want to spend time in these
matters in any case, so if someone brings a patch I hope you will not
refuse it.

>> But the use case is usually: PyPI is down, we fallback to a mirror. I
>> don't think it's more complicated than this.
> I don't agree.  On multiple levels.  PYPI is often up but slow.

That's an orthogonal issue :  any server can be slow.

One better way to drastically speed up buildout is to  download /
build stuff in parallel imo.

> It's also in the wrong place.  A CDN should provide better performance,
> reliability and locality.

Locality is indeed important, and picking up the nearest server is great.
Reliability is also solved by the mirrors.

> A client has to:
> - try pypi
> - fallback to "last"
> - If that's down, decide what other indexes to check
> I don't see how having timestamps help unless you know
> what the current timestamp is, unless you say that you'll reject
> a mirror with a timestamp more than some period in the past.

How hard it is to make those decisions ?

Do you really think getting the current timestamp is that hard ?

And the mirror timestamp,


In all you've said I fail to see how complicated it is, or long to do.

The ordering I see is:

normal behavior:
- if the cache is too old: get the list of mirrors  (-> the list of
mirrors and their timestamps get cached)
- pick the closest one
- use it

the server times out:
- try the "next closest"

> It's not clear what this time delta should be and, in any case,
> the client needs to first validate a mirror by checking it's timestamp.

This is the job of the client yes. An option that says, discard
mirrors that are > 1 day, or 5 hours etc.

Keeping a local cache that gets updated eventually is sufficient.

> I think this protocol is going to be hard to get right.

Maybe ? but if a v1 allows us to switch from server 1 being down to
server 2, it's already a success, no ?

servers that *we* the community, manage.

>>> - It either requires extra dns calls or relies to heavily on the last
>>> mirror, which is probably likely
>>>  to be the least reliable.
>> Once you have the list, I don't think you require extra call.
>> see http://hg.python.org/cpython/file/84280fac98b9/Lib/packaging/pypi/mirrors.py
> It has to make extra dns calls to resolve the other mirror names to ips.

Yeah, once per session. but in any case, this is not a decision you're
making on every download. It's something you do when you start to
download stuff, and/or when a server times out.

You stick with a server once it's working

>>> Life is short. We don't have to invent this ourselves.
>> Ah well, yeah -- Not sure what you are proposing right now.
>> If you imply that everything should be solved on server-side, and that
>> we should not have mirroring
> I think we should pick a good CDN and use it.

I won't object, because this is orthogonal to the mirroring stuff, but
I am not going to scratch the mirroring efforts to move PyPI to a
single shop.

Every service on the planet, even Amazon, can be down.

oh, my:

- https://forums.aws.amazon.com/message.jspa?messageID=244986
- http://money.cnn.com/2011/04/22/technology/amazon_ec2_cloud_outage/index.htm.
- http://www.labnol.org/internet/amazon-s3-cloudfront-down/5667/
- https://forums.aws.amazon.com/message.jspa?messageID=134012
- https://forums.aws.amazon.com/message.jspa?messageID=177654

Do we really want Amazon to handle PyPI ?

I prefer a bunch of community mirrors. Heck, I have one at Mozilla,
and might make it public one day  :)

Or maybe the optimal solution is our own CND proxy so we don't deal
with this on client side.

<music in the background with trumpets, a flag with the Python logo
raises, slowly>

But in any case, I'd rather have a Pythoneer from our community behind
every mirror server, I can trust



Tarek Ziadé | http://ziade.org

More information about the Distutils-SIG mailing list