[Tutor] Stupid bug
Terry Carroll
carroll at tjc.com
Thu Nov 11 00:16:19 CET 2010
This isn't a question, I'm just offering it as a cautionary tale and an
opportunity to laugh at my own stupidity.
I have a small function to calculate the MD5 checksum for a file. It's
nothing fancy:
###################################
import hashlib
def md5(filename, bufsize=65536):
"""
Compute md5 hash of the named file
bufsize is 64K by default
"""
m = hashlib.md5()
with open(filename,"rb") as fd:
content = fd.read(bufsize)
while content != "":
m.update(content)
content = fd.read(bufsize)
return m.hexdigest()
###################################
I've discovered a need to calculate the checksum on the first 10K or so
bytes of the file (faster when processing a whole CDROM or DVDROM full of
large files; and also allows me to find when one file is a truncated copy
of another).
This seemed like an easy enough variation, and I came up with something
like this:
###################################
def md5_partial(filename, bufsize=65536, numbytes=10240):
"""
Compute md5 hash of the first numbytes (10K by default) of named file
bufsize is 64K by default
"""
m = hashlib.md5()
with open(filename,"rb") as fd:
bytes_left = numbytes
bytes_to_read = min(bytes_left, bufsize)
content = fd.read(bytes_to_read)
bytes_left = bytes_left - bytes_to_read
while content != "" and bytes_left >0:
m.update(content)
bytes_to_read=min(bytes_left, bufsize)
content = fd.read(bytes_to_read)
bytes_left = bytes_left - bytes_to_read
return m.hexdigest()
###################################
Okay, not elegant, and violates DRY a little bit, but what the heck.
I set up a small file (a few hundred bytes) and confirmed that md5 and
md5_partial both returned the same value (where the number of bytes I was
sampling exceeded the size of the file). Great, working as desired.
But then when I tried a larger file, I was still getting the same checksum
for both. It was clearly processing the entire file.
I started messing with it; putting in counters and print statements,
using the Gettysburg Address as sample daya and iterating over
20 bytes at a time, printing out each one, making sure it stopped
appropriately. Still no luck.
I spent 90 minutes over two sessions when I finally found my error.
My invocation of the first checksum was:
###################################
checksumvalue = my.hashing.md5("filename.txt")
# (Not an error: I keep my own modules in Lib/site-packages/my/ )
print checksumvalue
#
# [several lines of code that among other things, define my new
# function being tested]
#
checksumvalue2 = md5_partial("filename.txt", numbytes=200
print checksumvalue
Turns out my function was working correctly all along; but with my typo, I
was printing out the value from the first checksum each time. Doh!
Well, no harm done, other than wasted time, and I did turn up a silly but
harmless off-by-one error in the process.
More information about the Tutor
mailing list