comparing multiple copies of terrabytes of data?
jcarlson at uci.edu
Mon Oct 25 22:01:20 CEST 2004
Istvan Albert <ialbert at mailblocks.com> wrote:
> Dan Stromberg wrote:
> > Rather than cmp'ing twice, to verify data integrity, I was thinking we
> > could speed up the comparison a bit, by using a python script that does 3
> Use the cmp. So what if you must run it twice ... by the way I
> really doubt that you could speed up the process in python
> ... you'll probably end up with a much slower version
In this case you would be wrong. Comparing data on a processor is
trivial (and would be done using Python's C internals anyways if a
strict string equality is all that matters), but IO is expensive.
Reading Terabytes of data is going to be the bottleneck, so reducing IO
is /the/ optimization that can and should be done.
The code to do so is simple:
def compare_3(fn1, fn2, fn3):
f1, f2, f3 = [open(i, 'rb') for i in (fn1, fn2, fn3)]
b = 2**20 #tune this as necessary
p = -1
good = 1
while f1.tell() < p:
p = f1.tell()
if f1.read(b) == f2.read(b) == f3.read(b):
print "files differ"
good = 0
if good and f1.read(1) == f2.read(1) == f3.read(1) == '':
print "files are identical"
f1.close() #I prefer to explicitly close my file handles
Note that it /may/ be faster to first convert the data into arrays
(module array) to get 2, 4 or 8 byte block comparisons.
More information about the Python-list