comparing multiple copies of terrabytes of data?
eddie at holyrood.ed.ac.uk
Tue Oct 26 13:26:04 CEST 2004
Dan Stromberg <strombrg at dcs.nac.uci.edu> writes:
>We will soon have 3 copies, for testing purposes, of what should be about
>4.5 terrabytes of data.
>Rather than cmp'ing twice, to verify data integrity, I was thinking we
>could speed up the comparison a bit, by using a python script that does 3
>reads, instead of 4 reads, per disk block - with a sufficiently large
>blocksize, of course.
>My question then is, does python have a high-level API that would
>facilitate this sort of thing, or should I just code something up based on
>open and read?
Taking a checksum of each file and comparing them would probably be much
faster. A quick test with md5 versus cmp gave me a 10 times speedup. Though,
unexpectedly, running 2 md5 processes in parallel was slower than 2 in
sequence - could be the cheap'n'nasty HD in my desktop, normally I'd expect a
gain here as 1 process CPUs while the other is IOing.
More information about the Python-list