Does hashlib support a file mode?

Wed Jul 6 02:37:47 EDT 2011

Am 06.07.2011 07:54 schrieb Phlip:
> Pythonistas:
>
> Consider this hashing code:
>
>    import hashlib
>    file = open(path)
>    m = hashlib.md5()
>    m.update(file.read())
>    digest = m.hexdigest()
>    file.close()
>
> If the file were huge, the file.read() would allocate a big string and
> thrash memory. (Yes, in 2011 that's still a problem, because these
> files could be movies and whatnot.)
>
> So if I do the stream trick - read one byte, update one byte, in a
> loop, then I'm essentially dragging that movie thru 8 bits of a 64 bit
> CPU. So that's the same problem; it would still be slow.

Yes. That is why you should read with a reasonable block size. Not too 
small and not too big.

def filechunks(f, size=8192):
     while True:
         s = f.read(size)
         if not s: break
         yield s
#    f.close() # maybe...

import hashlib
file = open(path)
m = hashlib.md5()
fc = filechunks(file)
for chunk in fc:
     m.update(chunk)
digest = m.hexdigest()
file.close()

So you are reading in 8 kiB chunks. Feel free to modify this - maybe use 
os.stat(file).st_blksize instead (which is AFAIK the recommended 
minimum), or a value of about 1 MiB...

> So now I try this:
>
>    sum = os.popen('sha256sum %r' % path).read()

This is not as nice as the above, especially not with a path containing 
strange characters. What about, at least,

def shellquote(*strs):
	return " ".join([
		"'"+st.replace("'","'\\''")+"'"
		for st in strs
	])

sum = os.popen('sha256sum %r' % shellquote(path)).read()

or, even better,

import subprocess
sp = subprocess.Popen(['sha256sum', path'],
     stdin=subprocess.PIPE, stdout=subprocess.PIPE)
sp.stdin.close() # generate EOF
sum = sp.stdout.read()
sp.wait()

?

> Does hashlib have a file-ready mode, to hide the streaming inside some
> clever DMA operations?

AFAIK not.

Thomas