Efficient MD5 (or similar) hashes

Michael T. Babcock mbabcock at fibrespeed.net
Mon Dec 8 17:54:14 CET 2003


> I have large files I'm dealing with.  Some 600MB -1.2 GB in size, over 
> a slow network.  Transfer of one of these files can take 40 minutes or 
> an hour.
>
> I want to check the integrity of the files after transfer.  I can 
> check the obvious - date, file size - quickly, but what if I want an 
> MD5 hash?
>
> From reading the python docs, md5 reads the entire file as a string. 
> That's not practical on a 1 GB file that's network mounted. 


The *obvious* answer is that the MD5's have to be available as files on 
the network mount, made by whatever / whoever put the files there in the 
first place, much like those available for large / secure downloads 
online in many places.

That is to say, you use an MD5 tool (md5sum / Python / etc.) to 
calculate the MD5 hashes *locally* to the files, then download both the 
file and the MD5 hash to your machine and re-check the hash and compare.

I'm not sure if I'm missing something in what you intend to do here ... 
if you've got a an NFS mount of ISO files for RedHat 9, you'd have:

/mnt/remote:
    redhat9-disc1.iso
    redhat9-disc1.iso.md5
    redhat9-disc2.iso
    redhat9-disc2.iso.md5
    redhat9-disc3.iso
    redhat9-disc3.iso.md5

You could do:
    $ (cat /mnt/remote/redhat9-disc$i.iso.md5; md5sum 
/mnt/remote/redhat9-disc$i.iso)
    ... and compare the two output lines (or use Python to compare, to 
keep this on-topic).  But you're complaining this would be slow.  If you 
could explain what you'd *like* to have happen, that would be great -- 
are you going to copy the files locally before comparing the MD5 
hashes?  That makes the above quite simple. 

PS, You need realize that calculating a hash on a file *requires* 
reading every byte of data in the file since otherwise it would be a 
pretty useless hash function and wouldn't actually detect errors.

-- 
Michael T. Babcock
C.T.O., FibreSpeed Ltd.
http://www.fibrespeed.net/~mbabcock







More information about the Python-list mailing list