[Catalog-sig] distribute D.C. sprint tasks

"Martin v. Löwis" martin at v.loewis.de
Mon Oct 13 00:18:35 CEST 2008


>>> Our z3c.pypimirror already performs an incremental update  based on
>>> the information available from the index.html page of the simple
>>> index and the available md5 hashes. Works like a charm...
>>>      
>>
>> So how does it find out when a release gets made?
>>    
> 
> What do you mean by that?

If you only look at

http://pypi.python.org/simple/

then you have no way of find out out what changed. So "the information
available from the index.html page of the simple index" is not actually
suitable for building incremental mirroring. What you describe is not
possible.

I just looked at the z3c.pypimirror source, and found that it isn't
really incremental: Whenever it mirrors, it looks at *all* index.html
pages, of each an every package (all 4900 of them, except when you
restrict the mirror). It then only downloads any new files that may
have been added/deleted, and it *is* incremental wrt. files. IIUC,
it is *not* incremental wrt. the package index itself.

Please correct me if I'm wrong (and please correct z3c.pypimirror
if I'm not :-)

Can you please set a specific useragent header, to find out what
amount of traffic pypimirror produces? Currently, urllib accounts for
17% of the requests, excluding requests made through urllib by
setuptools (which is a separate 18%). It's probably not all of them
through pypimirror, but of the 64626 requests made through urllib
yesterday, 41671 originated from zopyx.com.

For real incremental mirroring, you should retrieve the changelog,
and access only those package pages that have actually changed since
the last time you ran the mirror (successfully).

Regards,
Martin


More information about the Catalog-SIG mailing list