[Catalog-sig] PyPI mirrors are all up to date

Tue Apr 17 12:50:59 CEST 2012

On 4/17/12 11:57 AM, martin at v.loewis.de wrote:
> > by calculating the grand hash of each file hash.
>
> In this case, the checksum would not be a reliable indication that the
> files are actually up-to-date. For example, a mirror may keep updating
> files into the wrong location (not the location that is then used to
> serve the files), so that the files being served are from a stale copy.
> This is not theoretical - it actually happened in my mirror setup at one
> time.
>
So you were updating a directory but serving another directory ?

But then updating the right last-modified page people were seeing ?

In that case, updating the checksum would have revealed you were on the 
wrong set of files.

Unless you script was updating everything on a stale copy that was not 
published ?

>>> That could take a few hours per change.
>> why that ? you don't calculate the checksum of a file your already 
>> have twice.
>>
>> Even if you do, it's very fast to call md5.
>>
>> try it:
>>
>> $ find mirror | xargs md5
>>
>> this takes a few seconds at most on the whole mirror
>
> I tried it, and on my mirror, it took 27 minutes and 7 seconds.
> So not exactly hours, but not "a few seconds" either.
oops sorry I ran it on the wrong directory, it's true that it takes more 
time !

So on my centos 5 VM - which is quite slow and doing many other stuff 
like running Jenkins jobs, running the "md5deep" program like this : 
http://tarek.pastebin.mozilla.org/1574557

It took 15minutes and 1 second.   It can be optimized of course, since 
most directories are done quickly and everything is in /source. That 
time can be divided by 2 at least with the proper load balancing between 
a few md5 runners.

But that just to be run *once*. You would not compute it on every mirror 
update but keep all md5 values somewhere.

So, recalculating the grand hash on every mirror update should takes a 
few seconds because it would just consist of calculating the hash for 
the new files, then
calculating the grand hash -- a loop that updates a md5 hash with 20k 
hashes takes less than a second if I don't count the file reading.

(see http://tarek.pastebin.mozilla.org/1574574)

I am not sure why we're having this discussion since it's implementation 
details, but it's fun :)

If there's interest I can write a multiprocess-based script that keeps a 
md5 database up-to-date

Cheers
Tarek

>
> Regards,
> Martin
>
>