[Distutils] Fwd: The state of PyPI

Tue Sep 27 20:31:49 CEST 2011

On Tue, Sep 27, 2011 at 1:25 PM, Tarek Ziadé <ziade.tarek at gmail.com> wrote:
> On Tue, Sep 27, 2011 at 5:35 PM, Jim Fulton <jim at zope.com> wrote:

...

>> But I don't want to have to update buildout *just* because of an itch
>> to have a custom protocol.
>
> I kind of wonder how hard it would be to have a standalone pypi
> download client, ripped off from python 3.3's packaging, so you would
> not have to worry about this.

I doubt I'm going to be able to avoid worrying about it.

Still a reference client implementation would be useful.

> And, well, you do not sound like you want to spend time in these
> matters in any case,

I don't know what you mean.  Not sure I care. :)

> so if someone brings a patch I hope you will not
> refuse it.

No.  I'll eventually implement it if no one else does.

>>> But the use case is usually: PyPI is down, we fallback to a mirror. I
>>> don't think it's more complicated than this.
>>
>> I don't agree.  On multiple levels.  PYPI is often up but slow.
>
> That's an orthogonal issue :  any server can be slow.

A service can be fast even if an individual server is slow.
Also, CDNs can make lots of horsepower available that
is shared among multiple customers.  I really doubt that
anything we build will be faster.

> One better way to drastically speed up buildout is to  download /
> build stuff in parallel imo.

Thats true and something I'd like to do at some point. That's one of
the reasons I expect I'll have to worry about the protocol.

>
>> It's also in the wrong place.  A CDN should provide better performance,
>> reliability and locality.
>
> Locality is indeed important, and picking up the nearest server is great.
> Reliability is also solved by the mirrors.

At the expense of increased complexity on the client.

>>
>> A client has to:
>>
>> - try pypi
>> - fallback to "last"
>> - If that's down, decide what other indexes to check
>>
>> I don't see how having timestamps help unless you know
>> what the current timestamp is, unless you say that you'll reject
>> a mirror with a timestamp more than some period in the past.
>
> How hard it is to make those decisions ?

It's not "hard" conceptually, but it's still a lot of
implementation complexity and a lot of extra network
requests.

> Do you really think getting the current timestamp is that hard ?
>
> And the mirror timestamp,
>
>  http://b.pypi.python.org/last-modified
>
> In all you've said I fail to see how complicated it is, or long to do.

That's an extra HTTP request I need to make when I'm
considering use of a mirror.  If the first mirror I check seems to
be out of date, I may need to check all the mirrors.  It's an open
question what should be considered potentially out of date, a
timestamp older than an hour? a day?

> The ordering I see is:
>
> normal behavior:
> - if the cache is too old:

How old is too old?

> get the list of mirrors  (-> the list of
> mirrors and their timestamps get cached)

They'll only get cached for the program invocation.
This means I have to potentially check lots of mirrors
every time someone runs buildout.  I can reduce latency
by doing this in parallel, but that's still a lot of requests.

> - pick the closest one

How do I decide what's closest? Did you mean closest?
or most up to date

> - use it
>
> the server times out:
> - try the "next closest"
>
>
>> It's not clear what this time delta should be and, in any case,
>> the client needs to first validate a mirror by checking it's timestamp.
>
> This is the job of the client yes. An option that says, discard
> mirrors that are > 1 day, or 5 hours etc.

"etc" is just waving hands.  Selecting the right value is hard, possibly
application dependent. Is this a configuration variable?  Now the
user has something to deal with.

> Keeping a local cache that gets updated eventually is sufficient.

In process, or on disk?  This just gets better and better. :)

>> I think this protocol is going to be hard to get right.
>
> Maybe ? but if a v1 allows us to switch from server 1 being down to
> server 2, it's already a success, no ?
>
> servers that *we* the community, manage.

I fail to see why this is inherently a good thing.  I don't like
"managing" things.  Less work is good.

...

> Do we really want Amazon to handle PyPI ?

Yes, or Rackspace, or Google, or AOL, or, whatever.  Just not us.

(I suspect some of these might even do it for free.)

> I prefer a bunch of community mirrors. Heck, I have one at Mozilla,
> and might make it public one day  :)
>
> Or maybe the optimal solution is our own CND proxy so we don't deal
> with this on client side.
>
> <music in the background with trumpets, a flag with the Python logo
> raises, slowly>

Uh, yeah, sure.

FWIW, it hadn't occurred to me to use a CDN until a conversation a few
days ago. Doh.

Jim

-- 
Jim Fulton
http://www.linkedin.com/in/jimfulton