[Tutor] Cleaning up output

Wed Jul 3 23:10:52 CEST 2013

On 07/03/2013 03:51 PM, bjames at Jamesgang.dyndns.org wrote:
> I've written my first program to take a given directory and look in all
> directories below it for duplicate files (duplicate being defined as
> having the same MD5 hash, which I know isn't a perfect solution, but for
> what I'm doing is good enough)

This is a great first project for learning Python.  It's a utility which 
doesn't write any data to the disk (other than the result file), and 
therefore bugs won't cause much havoc.  Trust me, you will have bugs, we 
all do.  One of the things experience teaches you is how to isolate the 
damage that bugs do before they're discovered.

>
> My problem now is that my output file is a rather confusing jumble of
> paths and I'm not sure the best way to make it more user readable.  My gut
> reaction would be to go through and list by first directory, but is there
> a logical way to do it so that all the groupings that have files in the
> same two directories would be grouped together?

I've come up with the same "presentation problem" with my own similar 
utilities.  Be assured, there's no one "right answer."

First question is have you considered what you want when there are MORE 
than two copies of one of those files?  When you know what you'd like to 
see if there are four identical files, you might have a better idea what 
you should do even for two.  Additionally, consider that two identical 
files may be in the same directory, with different names.

Anyway, if you can explain why you want a particular grouping, we might 
better understand how to accomplish it.

>
> So I'm thinking I'd have:
> First File Dir /some/directory/
> Duplicate directories:
> some/other/directory/
>     Original file 1 , dupicate file 1
>     Original file 2, duplicate file 2
> some/third directory/
>     original file 3, duplicate file 3

At present, this First File Dir could be any of the directories 
involved;  without some effort, os.walk doesn't promise you any order of 
processing.  But if you want them to appear in sorted order, you can do 
sorts at key points inside your os.walk code, and they'll at least come 
out in an order that's recognizable.  (Some OS's may also sort things 
they feed to os.walk, but you'd do better not to count on it)  You also 
could sort each list in itervalues of hashdict, after the dict is fully 
populated.

Even with sorting, you run into the problem that there may be duplicates 
between some/other/directory and some/third/directory that are not in 
/some/directory.  So in the sample you show above, they won't be listed 
with the ones that are in /some/directory.

>
> and so forth, where the Original file would be the file name in the First
> files so that all the ones are the same there.
>
> I fear I'm not explaining this well but I'm hoping someone can either ask
> questions to help get out of my head what I'm trying to do or can decipher
> this enough to help me.
>
> Here's a git repo of my code if it helps:
> https://github.com/CyberCowboy/FindDuplicates
>

At 40 lines, you should have just included it.  It's usually much better 
to include the code inline if you want any comments on it.  Think of 
what the archives are going to show in a year, when you're removed that 
repo, or thoroughly updated it.  Somebody at that time will not be able 
to make sense of comments directed at the current version of the code.

BTW, thanks for posting as text, since that'll mean that when you do 
post code, it shouldn't get mangled.

So I'll comment on the code.

You never call the dupe() function, so presumably this is a module 
intended to be used from some place else.  But if that's the case, I 
would have expected it to be factored better, at least to separate the 
input processing from the output file formatting.  That way you could 
re-use the dups logic and provide a new save formatting without 
duplicating anything.  The first function could return the hashdict, and 
the second one could analyze it to produce a particular formatted output.

The hashdict and dups variables should be initialized within the 
function, since they are not going to be used outside.  Avoid non-const 
globals.  And of course once you factor it, dups will be in the second 
function only.

You do have a if __name__ == "__main__": line, but it's inside the 
function.  Probably you meant it to be at the left margin.  And 
importing inside a conditional is seldom a good idea, though it doesn't 
matter here since you're not using the import.  Normally you want all 
your imports at the top, so they're easy to spot.  You also probably 
want a call to dupe() inside the conditional.  And perhaps some parsing 
of argv to get rootdir.

You don't mention your OS, but many OS's have symbolic links or the 
equivalent.  There's no code here to handle that possibility.  Symlinks 
are a pain to do right.  You could just add it in your docs, that no 
symlink is allowed under the rootdir.

Your open() call has no mode switch.  If you want the md5 to be 
'correct", it has to be opened in binary.  So you want something like:
     with open(fullname, "rb") as f:

It doesn't matter this time, since you never export the hashes.  But if 
you ever needed to debug it, it'd be nice if the hashes matched the 
standard values provided by   md5sum.  Besides, if you're on Windows, 
treating a binary file as though it were text could increase the 
likelihood of getting md5 collisions.

-- 
DaveA