comparing multiple copies of terrabytes of data?

Tue Oct 26 07:26:04 EDT 2004

Dan Stromberg <strombrg at dcs.nac.uci.edu> writes:

>We will soon have 3 copies, for testing purposes, of what should be about
>4.5 terrabytes of data.

>Rather than cmp'ing twice, to verify data integrity, I was thinking we
>could speed up the comparison a bit, by using a python script that does 3
>reads, instead of 4 reads, per disk block - with a sufficiently large
>blocksize, of course.

>My question then is, does python have a high-level API that would
>facilitate this sort of thing, or should I just code something up based on
>open and read?

>Thanks!

Taking a checksum of each file and comparing them would probably be much
faster.  A quick test with md5 versus cmp gave me a 10 times speedup.  Though,
unexpectedly, running 2 md5 processes in parallel was slower than 2 in
sequence - could be the cheap'n'nasty HD in my desktop, normally I'd expect a
gain here as 1 process CPUs while the other is IOing.

Eddie