[Tutor] re:md5sum

John Moylan john.moylan@rte.ie
Fri Jul 18 04:59:01 2003


Thanks, your code helped alot.

I found that I still had to strip \n's from my filelist code with the
following though:

 for path in open('filelist3'):
      f = path.strip() #strip \n's otherwise "file not found" type error
      print md5file(open(f, "rb"))



> Hi everyone,
> 
> 
> One other potential bug: readlines() sucks in the whole file into memory
> at once, and treats it as a text file.  For large files, this may impact
> memory, so a safer approach is to use a "chunked" read():
> 
> ###
> def md5file(f):
>     """Returns an md5sum hex string of a file."""
>     m = md5.new()
>     while 1:
>         bytes = f.read(1024)
>         if not bytes: break
>         m.update(bytes)
>     return m.hexdigest()
> ###
> 
> 
> I read a kilobyte arbitrary: dunno why, I guess it's a nice round number.
> *grin* The important thing is to avoid reading the whole file at once, but
> to do it progressively.  md5's update() method is designed to be called
> many times precisely because it works on chunks at a time.  With this
> md5file() approach, we can now deal with files of any size without running
> into a memory problem.
> 
> 
> 
> Once we code up this md5file() utility function, John's original question:
> 
> """
> I am going to compute md5sums for each file and sort them; check for
> duplicate files.
> This is relatively easy in bash:
> for i in `cat filelist` do; md5sum $i; done
> """
> 
> 
> has a cute translation in Python:
> 
> ###
> for f in open('filelist'):
>     print md5file(open(f, "rb"))        ## Question: why "rb"?
> ###
> 
> 
> 
> There's one other subtle point that deals with the inner file-open()ing
> loop.  In particular, the "b"  "binary" open() part in the loop is
> important.  If we want to make sure we're getting the same results as our
> shell's "md5sum" utility, we must not treat newlines as special
> characters, since they too contribute to the md5 sum.  Opening a file in
> binary mode will make sure newlines are treated such as any other byte in
> our file.
> 
> 
> Hope this helps!
> 
> 
> 



******************************************************************************
The information in this e-mail is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this e-mail by anyone else
is unauthorised. If you are not the intended recipient, any disclosure,
copying, distribution, or any action taken or omitted to be taken in reliance
on it, is prohibited and may be unlawful.
Please note that emails to, from and within RTÉ may be subject to the Freedom
of Information Act 1997 and may be liable to disclosure.
******************************************************************************