[Tutor] First program after PyCamp

Tue Jun 11 00:21:39 CEST 2013

Thank you for your quick response Dave.  I found out after I submitted my
code here that it does in fact work now.  I had spent several interupted
hours trying to get it to work and must have changed something and
forgotten about the change before submitting it here.  I appreciate your
suggestions and plan to incorporate at least some of them to make the code
better (I understand it's not pythonic method change it once it's working
'well enough' but I'm going to for the learning experience.

> On 06/10/2013 04:03 PM, bjames at Jamesgang.dyndns.org wrote:
>> Hello I just took a 3 day PyCamp and am working on my first program from
>> start to finish at the moment and running into a problem.
>>
>> Not sure how it's supposed to be on this list
>
> Please start by defining your version of Python.  I'm going to assume
> Python 2.7, as in Python 3, you'd have gotten an error from the hashing,
> logic, after opening that file in text mode.
>
> so I'm going to first
>> describe what my program is supposed to do and where I'm having the
>> problems, then post the actual code.  Please don't simply come back with
>> a
>> working version of the code since I want to know what each step does,
>> but
>> also explain where I went wrong and why the new version is better, or
>> even
>> better suggest where I should look to fix it without actually fixing it
>> for me.
>>
>> The program is supposed to be given a directory and then look through
>> that
>> directory and all sub-directories and find duplicate files and output
>> the
>> list of duplicates to a text file.  This is accomplished by generating a
>> MD5 hash for the files then comparing that hash to a list of previous
>> hashes that have been generated.
>>
>> My problem is the text file that is output seems to contain EVERY file
>> that the program went through rather than just the duplicates.  I've
>> tried
>> to mentally step through and figure out where/why it's listing all of
>> them
>> rather than just duplicates and I seem to be failing to spot it.
>
> First some general comments.  Please don't use so many columns for your
> source file.  That may work for you on your own screen, but it's
> problematic when someone else has to deal with the code.  In my case,
> several of those comments wrapped in the email, and I had to re-edit
> them before trying things.
>
> Next, try to factor your code into reasonable pieces.  You have a single
> function that does at least 3 things;  make it three functions.  One
> reason is that then you can frequently figure out which ones of them are
> correct, and which ones need work.  Another reason is you can reuse the
> logic you spent time building and testing.
>
> First function gathers filenames that meet a particular criteria.  In
> this case, it's simply all files in a subtree of a starting place.
> After you get the hang of things, you'll realize this could better be a
> generator, but no hurry yet. At present, it should generate a list, and
> RETURN that list.  Not just tuck things into some global.
>
> Second function generates md5 checksums of all those file, uses that
> checksum as a key, and groups the files having the same key together.
> It should return the dict, rather than modifying some global one.
>
> Third function analyzes the dict created by the second one, and prepares
> a report (to file, or to print).
>
>
>>
>> here's the code:
>> ---begin code paste---
>> import os, hashlib
>> #rootdir = 'c:\Python Test'
>
> I know this is commented out, but notice that the backslash is a big
> risk here.  If a directory happened to start with t, for example, the
> filename would have a tab in it.  Either use forward slashes (which do
> work in Windows), or use a raw literal.  Or escape it with doubling the
> backslashes.
>
>> hashlist = {}  # content signature -> list of filenames
>> dups = []
>>
>> def get_hash(rootdir):
>> #   """goes through directory tree, compares md5 hash of all files,
>> #   combines files with same hash value into list in hashmap
>> directory"""
>>      for path, dirs, files in os.walk(rootdir):
>>          #this section goes through the given directory, and all
>> subdirectories/files below
>>          #as part of a loop reading them in
>>          for filename in files:
>>              #steps through each file and starts the process of getting
>> the
>> MD5 hashes for the file
>>              #comparing that hash to known hashes that have already been
>> calculated and either merges it
>>              #with the known hash (which indicates duplicates) or adds
>> it
>> so that it can be compared to future
>>              #files
>>              fullname = os.path.join(path, filename)
>>              with open(fullname) as f:
>
> You're defaulting to text files, so the checksum will not in general
> match the standard md5sum which should get the exact same value.  And in
> Python 3, this would try to decode the file as text, using some default
> encoding, and even if it didn't fail, would then get an error down
> below, since the hexdigest() stuff doesn't work on characters, but on
> bytes.
>
>                 with open(fullname, "rb") as f:
>
>>                  #does the actual hashing
>>                  md5 = hashlib.md5()
>>                  while True:
>>                      d = f.read(4096)
>>                      if not d:
>>                          break
>>                      md5.update(d)
>>                  h = md5.hexdigest()
>>                  filelist = hashlist.setdefault(h, [])
>>                  filelist.append(fullname)
>>
>>      for x in hashlist:
>>          currenthash = hashlist[x]
>
> This is confusing both for the names, and for the inefficiency.  Try
>         for hash, files in hashlist.items():
>
> hash is what you called x, and files is what you called currenthash.
>
>>          #goes through and if has has more than one file listed with it
>>          #considers it a duplicate and adds it to the output list
>>          if len(currenthash) > 1:
>>              dups.append(currenthash)
>
> Now with the renaming, this becomes:
>
>             if len(files) > 1:
>                  dups.append(files)
>
> Note that if I were you, I'd make it
>                  dups.append(hash, files)
>
> that way, the output would (by default) show the hash that these files
> supposedly shared.
>
>>      output = open('duplicates.txt','w')
>>      output.write(str(dups))
>>      output.close()
>
> Clearly this output could be made prettier, once you're confident the
> other parts are working.  But I imagine you just hadn't gotten to that
> yet.
>
> Now, the code worked fine for me, and I doublechecked each of the
> duplicates it found with md5sum (standard on Linux, and available for
> Windows).
>
> When you look at the output, are there any entries that do NOT have
> multiple files listed?  Have you done a dir /s /b of the same directory
> tree, redirected it to a file, and compared the size of that file to
> what this code finds?
>
>
> --
> DaveA
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>