Question regarding checksuming of a file

Paul Rubin http
Sun May 14 04:05:41 EDT 2006


"Ant" <antroy at gmail.com> writes:
> def getSum(self):
>         md5Sum = md5.new()
>         f = open(self.filename, 'rb')
>         for line in f:
>             md5Sum.update(line)
>         f.close()
>         return md5Sum.hexdigest()

This should work, but there is one hazard if the file is very large
and is not a text file.  You're trying to read one line at a time from
it, which means a contiguous string of characters up to a newline.
Depending on the file contents, that could mean gigabytes which get
read into memory.  So it's best to read a fixed size amount in each
operation, e.g. (untested):

   def getblocks(f, blocksize=1024):
      while True:
        s = f.read(blocksize)
        if not s: return
        yield s

then change "for line in f" to "for line in f.getblocks()".

I actually think an iterator like the above should be added to the
stdlib, since the "for line in f" idiom is widely used and sometimes
inadvisable, like the fixed sized buffers in those old C programs
that led to buffer overflow bugs.



More information about the Python-list mailing list