[Catalog-sig] PyPI mirrors are all up to date
Tarek Ziadé
tarek at ziade.org
Tue Apr 17 12:50:59 CEST 2012
On 4/17/12 11:57 AM, martin at v.loewis.de wrote:
> > by calculating the grand hash of each file hash.
>
> In this case, the checksum would not be a reliable indication that the
> files are actually up-to-date. For example, a mirror may keep updating
> files into the wrong location (not the location that is then used to
> serve the files), so that the files being served are from a stale copy.
> This is not theoretical - it actually happened in my mirror setup at one
> time.
>
So you were updating a directory but serving another directory ?
But then updating the right last-modified page people were seeing ?
In that case, updating the checksum would have revealed you were on the
wrong set of files.
Unless you script was updating everything on a stale copy that was not
published ?
>>> That could take a few hours per change.
>> why that ? you don't calculate the checksum of a file your already
>> have twice.
>>
>> Even if you do, it's very fast to call md5.
>>
>> try it:
>>
>> $ find mirror | xargs md5
>>
>> this takes a few seconds at most on the whole mirror
>
> I tried it, and on my mirror, it took 27 minutes and 7 seconds.
> So not exactly hours, but not "a few seconds" either.
oops sorry I ran it on the wrong directory, it's true that it takes more
time !
So on my centos 5 VM - which is quite slow and doing many other stuff
like running Jenkins jobs, running the "md5deep" program like this :
http://tarek.pastebin.mozilla.org/1574557
It took 15minutes and 1 second. It can be optimized of course, since
most directories are done quickly and everything is in /source. That
time can be divided by 2 at least with the proper load balancing between
a few md5 runners.
But that just to be run *once*. You would not compute it on every mirror
update but keep all md5 values somewhere.
So, recalculating the grand hash on every mirror update should takes a
few seconds because it would just consist of calculating the hash for
the new files, then
calculating the grand hash -- a loop that updates a md5 hash with 20k
hashes takes less than a second if I don't count the file reading.
(see http://tarek.pastebin.mozilla.org/1574574)
I am not sure why we're having this discussion since it's implementation
details, but it's fun :)
If there's interest I can write a multiprocess-based script that keeps a
md5 database up-to-date
Cheers
Tarek
>
> Regards,
> Martin
>
>
More information about the Catalog-SIG
mailing list