[Tutor] First program after PyCamp

Tue Jun 11 00:13:26 CEST 2013

On 06/10/2013 04:03 PM, bjames at Jamesgang.dyndns.org wrote:
> Hello I just took a 3 day PyCamp and am working on my first program from
> start to finish at the moment and running into a problem.
>
> Not sure how it's supposed to be on this list

Please start by defining your version of Python.  I'm going to assume 
Python 2.7, as in Python 3, you'd have gotten an error from the hashing, 
logic, after opening that file in text mode.

so I'm going to first
> describe what my program is supposed to do and where I'm having the
> problems, then post the actual code.  Please don't simply come back with a
> working version of the code since I want to know what each step does, but
> also explain where I went wrong and why the new version is better, or even
> better suggest where I should look to fix it without actually fixing it
> for me.
>
> The program is supposed to be given a directory and then look through that
> directory and all sub-directories and find duplicate files and output the
> list of duplicates to a text file.  This is accomplished by generating a
> MD5 hash for the files then comparing that hash to a list of previous
> hashes that have been generated.
>
> My problem is the text file that is output seems to contain EVERY file
> that the program went through rather than just the duplicates.  I've tried
> to mentally step through and figure out where/why it's listing all of them
> rather than just duplicates and I seem to be failing to spot it.

First some general comments.  Please don't use so many columns for your 
source file.  That may work for you on your own screen, but it's 
problematic when someone else has to deal with the code.  In my case, 
several of those comments wrapped in the email, and I had to re-edit 
them before trying things.

Next, try to factor your code into reasonable pieces.  You have a single 
function that does at least 3 things;  make it three functions.  One 
reason is that then you can frequently figure out which ones of them are 
correct, and which ones need work.  Another reason is you can reuse the 
logic you spent time building and testing.

First function gathers filenames that meet a particular criteria.  In 
this case, it's simply all files in a subtree of a starting place. 
After you get the hang of things, you'll realize this could better be a 
generator, but no hurry yet. At present, it should generate a list, and 
RETURN that list.  Not just tuck things into some global.

Second function generates md5 checksums of all those file, uses that 
checksum as a key, and groups the files having the same key together. 
It should return the dict, rather than modifying some global one.

Third function analyzes the dict created by the second one, and prepares 
a report (to file, or to print).

>
> here's the code:
> ---begin code paste---
> import os, hashlib
> #rootdir = 'c:\Python Test'

I know this is commented out, but notice that the backslash is a big 
risk here.  If a directory happened to start with t, for example, the 
filename would have a tab in it.  Either use forward slashes (which do 
work in Windows), or use a raw literal.  Or escape it with doubling the 
backslashes.

> hashlist = {}  # content signature -> list of filenames
> dups = []
>
> def get_hash(rootdir):
> #   """goes through directory tree, compares md5 hash of all files,
> #   combines files with same hash value into list in hashmap directory"""
>      for path, dirs, files in os.walk(rootdir):
>          #this section goes through the given directory, and all
> subdirectories/files below
>          #as part of a loop reading them in
>          for filename in files:
>              #steps through each file and starts the process of getting the
> MD5 hashes for the file
>              #comparing that hash to known hashes that have already been
> calculated and either merges it
>              #with the known hash (which indicates duplicates) or adds it
> so that it can be compared to future
>              #files
>              fullname = os.path.join(path, filename)
>              with open(fullname) as f:

You're defaulting to text files, so the checksum will not in general 
match the standard md5sum which should get the exact same value.  And in 
Python 3, this would try to decode the file as text, using some default 
encoding, and even if it didn't fail, would then get an error down 
below, since the hexdigest() stuff doesn't work on characters, but on bytes.

                with open(fullname, "rb") as f:

>                  #does the actual hashing
>                  md5 = hashlib.md5()
>                  while True:
>                      d = f.read(4096)
>                      if not d:
>                          break
>                      md5.update(d)
>                  h = md5.hexdigest()
>                  filelist = hashlist.setdefault(h, [])
>                  filelist.append(fullname)
>
>      for x in hashlist:
>          currenthash = hashlist[x]

This is confusing both for the names, and for the inefficiency.  Try
        for hash, files in hashlist.items():

hash is what you called x, and files is what you called currenthash.

>          #goes through and if has has more than one file listed with it
>          #considers it a duplicate and adds it to the output list
>          if len(currenthash) > 1:
>              dups.append(currenthash)

Now with the renaming, this becomes:

            if len(files) > 1:
                 dups.append(files)

Note that if I were you, I'd make it
                 dups.append(hash, files)

that way, the output would (by default) show the hash that these files 
supposedly shared.

>      output = open('duplicates.txt','w')
>      output.write(str(dups))
>      output.close()

Clearly this output could be made prettier, once you're confident the 
other parts are working.  But I imagine you just hadn't gotten to that yet.

Now, the code worked fine for me, and I doublechecked each of the 
duplicates it found with md5sum (standard on Linux, and available for 
Windows).

When you look at the output, are there any entries that do NOT have 
multiple files listed?  Have you done a dir /s /b of the same directory 
tree, redirected it to a file, and compared the size of that file to 
what this code finds?

-- 
DaveA