comparing multiple copies of terrabytes of data?

Dan Stromberg strombrg at
Fri Oct 29 03:27:47 CEST 2004

Folks, I'm a little overwhelmed by how helpful this thread has been to me
- thanks a bunch.

I'd like to point out that I have some code written up
that I hope to use correctly the first time, in verifying that our three
copies of a 3+ terrabyte collection of data...  well, that the first copy
is a (not-necessarily-proper) subset of the other two copies.  Once we
sign off, saying "the data copies look correct", then IBM's going to say
"OK, for better or worse, it's your storage system now, problems and all."

This storage system is the first of its kind in the known universe (as
far as the filesystem software combined with very high storage density
linux boxes), and I'm guessing that there's around $500K sunk into it,
just on hardware.  -But-, it's far cheaper than almost any other solution
of its size.

I'll add that both the system we are transitioning from is opensource from
Redhat, and is apparently pureplay GPL, and the system we are
transitioning to is from ClusterFS, which has a dual-license GPL/Closed
source thing going on (where the Closed source stuff is transitioned to
GPL eventually, kind of like Alladin ghostscript).

Rather than letting the specifics of the situation creep into the code too
much, I've tried to make it into a fairly general, reusable tool that
others might be able to benefit from as well.

You basically give the program a sampling frequency (number of files to
skip before checking one), and a list of directories.  It uses md5
hashes to keep resource utilization down.

I'm basically soliciting additional eyeballs for what I hope I've
persuaded you is a good cause for more than one reason, if you find
yourself with some time and curiosity.  It's 108 lines of code (including
the usage() function), and 15 lines of comments.

The URL is

Thanks for even considering looking at this with me!

More information about the Python-list mailing list