[Tutor] trying to get md5sums of a list of files

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Thu Jul 17 14:17:20 2003


On Thu, 17 Jul 2003, Sean 'Shaleh' Perry wrote:

> > So I coded up the following:
> >
> > #!/usr/local/bin/python
> >
> > import os, sys, md5
> >
> > for path in open('filelist2'):
> >         myline = path.strip()
> >         f = open(myline, 'r')
> >         m = md5.new()
> >         for line in f.readlines():
> >                 m.update(line)
> >         f.close()
> >         md5sum = m.digest()
> >         print m


Hi everyone,


One other potential bug: readlines() sucks in the whole file into memory
at once, and treats it as a text file.  For large files, this may impact
memory, so a safer approach is to use a "chunked" read():

###
def md5file(f):
    """Returns an md5sum hex string of a file."""
    m = md5.new()
    while 1:
        bytes = f.read(1024)
        if not bytes: break
        m.update(bytes)
    return m.hexdigest()
###


I read a kilobyte arbitrary: dunno why, I guess it's a nice round number.
*grin* The important thing is to avoid reading the whole file at once, but
to do it progressively.  md5's update() method is designed to be called
many times precisely because it works on chunks at a time.  With this
md5file() approach, we can now deal with files of any size without running
into a memory problem.



Once we code up this md5file() utility function, John's original question:

"""
I am going to compute md5sums for each file and sort them; check for
duplicate files.
This is relatively easy in bash:
for i in `cat filelist` do; md5sum $i; done
"""


has a cute translation in Python:

###
for f in open('filelist'):
    print md5file(open(f, "rb"))        ## Question: why "rb"?
###



There's one other subtle point that deals with the inner file-open()ing
loop.  In particular, the "b"  "binary" open() part in the loop is
important.  If we want to make sure we're getting the same results as our
shell's "md5sum" utility, we must not treat newlines as special
characters, since they too contribute to the md5 sum.  Opening a file in
binary mode will make sure newlines are treated such as any other byte in
our file.


Hope this helps!