[Tutor] Cleaning up output

Wed Jul 3 22:30:58 CEST 2013

bjames at Jamesgang.dyndns.org wrote:
> I've written my first program to take a given directory and look in all
> directories below it for duplicate files (duplicate being defined as
> having the same MD5 hash, which I know isn't a perfect solution, but for
> what I'm doing is good enough)
> 
> My problem now is that my output file is a rather confusing jumble of
> paths and I'm not sure the best way to make it more user readable.  My gut
> reaction would be to go through and list by first directory, but is there
> a logical way to do it so that all the groupings that have files in the
> same two directories would be grouped together?
> 
> So I'm thinking I'd have:
> First File Dir /some/directory/
> Duplicate directories:
> some/other/directory/
>    Original file 1 , dupicate file 1
>    Original file 2, duplicate file 2
> some/third directory/
>    original file 3, duplicate file 3
> 
> and so forth, where the Original file would be the file name in the First
> files so that all the ones are the same there.
> 
> I fear I'm not explaining this well but I'm hoping someone can either ask
> questions to help get out of my head what I'm trying to do or can decipher
> this enough to help me.
> 
> Here's a git repo of my code if it helps:
> https://github.com/CyberCowboy/FindDuplicates
> 

Your file was not too big to just paste in the email. (Pasted below)

import os, hashlib
hashdict = {}  # content signature -> list of filenames
dups = []

def dupe(rootdir):
    """goes through directory tree, compares md5 hash of all files,
    combines files with same hash value into list in hashmap directory"""
    for path, dirs, files in os.walk(unicode(rootdir)):
        #this section goes through the given directory, and all subdirectories/files below
        #as part of a loop reading them in
        for filename in files:
            #steps through each file and starts the process of getting the MD5 hashes for the file
            #comparing that hash to known hashes that have already been calculated and either merges it
            #with the known hash (which indicates duplicates) or adds it so that it can be compared to future
            #files
            fullname = os.path.join(path, filename)
            with open(fullname) as f:
                #does the actual hashing
                md5 = hashlib.md5()
                while True:
                    d = f.read(4096)
                    if not d:
                        break
                    md5.update(d)
                h = md5.hexdigest()
                filelist = hashdict.setdefault(h, [])
                filelist.append(fullname)   

    for currenthash in hashdict.itervalues():
        #goes through and if has has more than one file listed with it
        #considers it a duplicate and adds it to the output list
        if len(currenthash) > 1:
            dups.append(currenthash)
    output = open('duplicates.txt','w')
    for x in dups:
        output.write(str(x))
        output.write("\n")
    output.close()

Why do you only read 4096 bytes at a time? Guarding against large
files? 

I would probably sort on original file and print "original: dup1 dup2".
That way it should automatically put original files together by
directory.

for currenthash in hashdict.itervalues():
    #goes through and if has has more than one file listed with it
    #considers it a duplicate and adds it to the output list
    if len(currenthash) > 1:
        dups.append(currenthash)
dups.sort() # Should sort by original file then first duplicate, etc
# if you want to only sort by original file use 
# dups.sort(key=lambda x: x[0])
with open('duplicates.txt','w') as output:
    for x in dups:
        output.write('Original {0} | {1}\n'.format(x[0], ' '.join(x[1:])))

~Ramit

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.