[Distutils] Fwd: The state of PyPI
Jim Fulton
jim at zope.com
Tue Sep 27 20:31:49 CEST 2011
On Tue, Sep 27, 2011 at 1:25 PM, Tarek Ziadé <ziade.tarek at gmail.com> wrote:
> On Tue, Sep 27, 2011 at 5:35 PM, Jim Fulton <jim at zope.com> wrote:
...
>> But I don't want to have to update buildout *just* because of an itch
>> to have a custom protocol.
>
> I kind of wonder how hard it would be to have a standalone pypi
> download client, ripped off from python 3.3's packaging, so you would
> not have to worry about this.
I doubt I'm going to be able to avoid worrying about it.
Still a reference client implementation would be useful.
> And, well, you do not sound like you want to spend time in these
> matters in any case,
I don't know what you mean. Not sure I care. :)
> so if someone brings a patch I hope you will not
> refuse it.
No. I'll eventually implement it if no one else does.
>>> But the use case is usually: PyPI is down, we fallback to a mirror. I
>>> don't think it's more complicated than this.
>>
>> I don't agree. On multiple levels. PYPI is often up but slow.
>
> That's an orthogonal issue : any server can be slow.
A service can be fast even if an individual server is slow.
Also, CDNs can make lots of horsepower available that
is shared among multiple customers. I really doubt that
anything we build will be faster.
> One better way to drastically speed up buildout is to download /
> build stuff in parallel imo.
Thats true and something I'd like to do at some point. That's one of
the reasons I expect I'll have to worry about the protocol.
>
>> It's also in the wrong place. A CDN should provide better performance,
>> reliability and locality.
>
> Locality is indeed important, and picking up the nearest server is great.
> Reliability is also solved by the mirrors.
At the expense of increased complexity on the client.
>>
>> A client has to:
>>
>> - try pypi
>> - fallback to "last"
>> - If that's down, decide what other indexes to check
>>
>> I don't see how having timestamps help unless you know
>> what the current timestamp is, unless you say that you'll reject
>> a mirror with a timestamp more than some period in the past.
>
> How hard it is to make those decisions ?
It's not "hard" conceptually, but it's still a lot of
implementation complexity and a lot of extra network
requests.
> Do you really think getting the current timestamp is that hard ?
>
> And the mirror timestamp,
>
> http://b.pypi.python.org/last-modified
>
> In all you've said I fail to see how complicated it is, or long to do.
That's an extra HTTP request I need to make when I'm
considering use of a mirror. If the first mirror I check seems to
be out of date, I may need to check all the mirrors. It's an open
question what should be considered potentially out of date, a
timestamp older than an hour? a day?
> The ordering I see is:
>
> normal behavior:
> - if the cache is too old:
How old is too old?
> get the list of mirrors (-> the list of
> mirrors and their timestamps get cached)
They'll only get cached for the program invocation.
This means I have to potentially check lots of mirrors
every time someone runs buildout. I can reduce latency
by doing this in parallel, but that's still a lot of requests.
> - pick the closest one
How do I decide what's closest? Did you mean closest?
or most up to date
> - use it
>
> the server times out:
> - try the "next closest"
>
>
>> It's not clear what this time delta should be and, in any case,
>> the client needs to first validate a mirror by checking it's timestamp.
>
> This is the job of the client yes. An option that says, discard
> mirrors that are > 1 day, or 5 hours etc.
"etc" is just waving hands. Selecting the right value is hard, possibly
application dependent. Is this a configuration variable? Now the
user has something to deal with.
> Keeping a local cache that gets updated eventually is sufficient.
In process, or on disk? This just gets better and better. :)
>> I think this protocol is going to be hard to get right.
>
> Maybe ? but if a v1 allows us to switch from server 1 being down to
> server 2, it's already a success, no ?
>
> servers that *we* the community, manage.
I fail to see why this is inherently a good thing. I don't like
"managing" things. Less work is good.
...
> Do we really want Amazon to handle PyPI ?
Yes, or Rackspace, or Google, or AOL, or, whatever. Just not us.
(I suspect some of these might even do it for free.)
> I prefer a bunch of community mirrors. Heck, I have one at Mozilla,
> and might make it public one day :)
>
> Or maybe the optimal solution is our own CND proxy so we don't deal
> with this on client side.
>
> <music in the background with trumpets, a flag with the Python logo
> raises, slowly>
Uh, yeah, sure.
FWIW, it hadn't occurred to me to use a CDN until a conversation a few
days ago. Doh.
Jim
--
Jim Fulton
http://www.linkedin.com/in/jimfulton
More information about the Distutils-SIG
mailing list