how can I make this script shorter?

Christos TZOTZIOY Georgiou tzot at sil-tec.gr
Tue Feb 22 05:25:02 EST 2005


On Tue, 22 Feb 2005 00:34:39 -0800, rumours say that Lowell Kirsh
<lkirsh at cs.ubc.ca> might have written:

>I have a script which I use to find all duplicates of files within a 
>given directory and all its subdirectories. It seems like it's longer 
>than it needs to be but I can't figure out how to shorten it. Perhaps 
>there are some python features or libraries I'm not taking advantage of.
>
>The way it works is that it puts references to all the files in a 
>dictionary with file size being the key. The dictionary can hold 
>multiple values per key. Then it looks at each key and all the 
>associated files (which are the same size). Then it uses filecmp to see 
>if they are actually byte-for-byte copies.
>
>It's not 100% complete but it's pretty close.

I can't advise on code length; my dupefind.py script is 361 lines, but the
algorithm is slightly more complex to speed things up (and it also optionally
hardlinks identical files on POSIX and NTFS filesystems).  If in your case there
are lots of files of several MiB each, that often match on size, you could avoid
lots of comparisons if you did match based on some hash (md5 or sha).

You could also compare first on the hash of the first few kiB (I check 8 kiB) to
see if you need to read the whole file or not.

So:

for every file:
  if other files exist with the same size:
    calculate hash for first few kiB
    if file has same "initial" hash with other files:
      calculate full hash
      return all files with same full hash

Something like that.
-- 
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...



More information about the Python-list mailing list