[Tutor] Cleaning up output
Dave Angel
davea at davea.name
Wed Jul 3 23:10:52 CEST 2013
On 07/03/2013 03:51 PM, bjames at Jamesgang.dyndns.org wrote:
> I've written my first program to take a given directory and look in all
> directories below it for duplicate files (duplicate being defined as
> having the same MD5 hash, which I know isn't a perfect solution, but for
> what I'm doing is good enough)
This is a great first project for learning Python. It's a utility which
doesn't write any data to the disk (other than the result file), and
therefore bugs won't cause much havoc. Trust me, you will have bugs, we
all do. One of the things experience teaches you is how to isolate the
damage that bugs do before they're discovered.
>
> My problem now is that my output file is a rather confusing jumble of
> paths and I'm not sure the best way to make it more user readable. My gut
> reaction would be to go through and list by first directory, but is there
> a logical way to do it so that all the groupings that have files in the
> same two directories would be grouped together?
I've come up with the same "presentation problem" with my own similar
utilities. Be assured, there's no one "right answer."
First question is have you considered what you want when there are MORE
than two copies of one of those files? When you know what you'd like to
see if there are four identical files, you might have a better idea what
you should do even for two. Additionally, consider that two identical
files may be in the same directory, with different names.
Anyway, if you can explain why you want a particular grouping, we might
better understand how to accomplish it.
>
> So I'm thinking I'd have:
> First File Dir /some/directory/
> Duplicate directories:
> some/other/directory/
> Original file 1 , dupicate file 1
> Original file 2, duplicate file 2
> some/third directory/
> original file 3, duplicate file 3
At present, this First File Dir could be any of the directories
involved; without some effort, os.walk doesn't promise you any order of
processing. But if you want them to appear in sorted order, you can do
sorts at key points inside your os.walk code, and they'll at least come
out in an order that's recognizable. (Some OS's may also sort things
they feed to os.walk, but you'd do better not to count on it) You also
could sort each list in itervalues of hashdict, after the dict is fully
populated.
Even with sorting, you run into the problem that there may be duplicates
between some/other/directory and some/third/directory that are not in
/some/directory. So in the sample you show above, they won't be listed
with the ones that are in /some/directory.
>
> and so forth, where the Original file would be the file name in the First
> files so that all the ones are the same there.
>
> I fear I'm not explaining this well but I'm hoping someone can either ask
> questions to help get out of my head what I'm trying to do or can decipher
> this enough to help me.
>
> Here's a git repo of my code if it helps:
> https://github.com/CyberCowboy/FindDuplicates
>
At 40 lines, you should have just included it. It's usually much better
to include the code inline if you want any comments on it. Think of
what the archives are going to show in a year, when you're removed that
repo, or thoroughly updated it. Somebody at that time will not be able
to make sense of comments directed at the current version of the code.
BTW, thanks for posting as text, since that'll mean that when you do
post code, it shouldn't get mangled.
So I'll comment on the code.
You never call the dupe() function, so presumably this is a module
intended to be used from some place else. But if that's the case, I
would have expected it to be factored better, at least to separate the
input processing from the output file formatting. That way you could
re-use the dups logic and provide a new save formatting without
duplicating anything. The first function could return the hashdict, and
the second one could analyze it to produce a particular formatted output.
The hashdict and dups variables should be initialized within the
function, since they are not going to be used outside. Avoid non-const
globals. And of course once you factor it, dups will be in the second
function only.
You do have a if __name__ == "__main__": line, but it's inside the
function. Probably you meant it to be at the left margin. And
importing inside a conditional is seldom a good idea, though it doesn't
matter here since you're not using the import. Normally you want all
your imports at the top, so they're easy to spot. You also probably
want a call to dupe() inside the conditional. And perhaps some parsing
of argv to get rootdir.
You don't mention your OS, but many OS's have symbolic links or the
equivalent. There's no code here to handle that possibility. Symlinks
are a pain to do right. You could just add it in your docs, that no
symlink is allowed under the rootdir.
Your open() call has no mode switch. If you want the md5 to be
'correct", it has to be opened in binary. So you want something like:
with open(fullname, "rb") as f:
It doesn't matter this time, since you never export the hashes. But if
you ever needed to debug it, it'd be nice if the hashes matched the
standard values provided by md5sum. Besides, if you're on Windows,
treating a binary file as though it were text could increase the
likelihood of getting md5 collisions.
--
DaveA
More information about the Tutor
mailing list