[Tutor] Stupid bug

Thu Nov 11 00:16:19 CET 2010

This isn't a question, I'm just offering it as a cautionary tale and an 
opportunity to laugh at my own stupidity.

I have a small function to calculate the MD5 checksum for a file.   It's 
nothing fancy:

###################################
import hashlib
def md5(filename, bufsize=65536):
     """
     Compute md5 hash of the named file
     bufsize is 64K by default
     """
     m = hashlib.md5()
     with open(filename,"rb") as fd:
         content = fd.read(bufsize)
         while content != "":
             m.update(content)
             content = fd.read(bufsize)
     return m.hexdigest()
###################################

I've discovered a need to calculate the checksum on the first 10K or so 
bytes of the file (faster when processing a whole CDROM or DVDROM full of 
large files; and also allows me to find when one file is a truncated copy 
of another).

This seemed like an easy enough variation, and I came up with something 
like this:

###################################
def md5_partial(filename, bufsize=65536, numbytes=10240):
     """
     Compute md5 hash of the first numbytes (10K by default) of named file
     bufsize is 64K by default
     """
     m = hashlib.md5()
     with open(filename,"rb") as fd:
         bytes_left = numbytes
         bytes_to_read = min(bytes_left, bufsize)
         content = fd.read(bytes_to_read)
         bytes_left = bytes_left - bytes_to_read
         while content != "" and bytes_left >0:
             m.update(content)
             bytes_to_read=min(bytes_left, bufsize)
             content = fd.read(bytes_to_read)
             bytes_left = bytes_left - bytes_to_read
     return m.hexdigest()
###################################

Okay, not elegant, and violates DRY a little bit, but what the heck.

I set up a small file (a few hundred bytes) and confirmed that md5 and 
md5_partial both returned the same value (where the number of bytes I was 
sampling exceeded the size of the file).  Great, working as desired.

But then when I tried a larger file, I was still getting the same checksum 
for both.  It was clearly processing the entire file.

I started messing with it; putting in counters and print statements, 
using the Gettysburg Address as sample daya and iterating over 
20 bytes at a time, printing out each one, making sure it stopped 
appropriately.  Still no luck.

I spent 90 minutes over two sessions when I finally found my error.

My invocation of the first checksum was:

###################################
checksumvalue = my.hashing.md5("filename.txt")
# (Not an error: I keep my own modules in Lib/site-packages/my/ )
print checksumvalue
#
# [several lines of code that among other things, define my new
# function being tested]
#
checksumvalue2 = md5_partial("filename.txt", numbytes=200
print checksumvalue

Turns out my function was working correctly all along; but with my typo, I 
was printing out the value from the first checksum each time.  Doh!

Well, no harm done, other than wasted time, and I did turn up a silly but 
harmless off-by-one error in the process.