Howto find same files?
Dieter Deyke
deyke at cocreate.fc.hp.com
Sat Oct 28 18:19:53 EDT 2000
gregoire.favre at ima.unil.ch writes:
> Hello,
>
> two friends tell me that I should go to python to solve my problem:
> I have fetched some files (quite a lots) that I have put in /data (a
> lots of patchxxx.{gz,bz2} of lots of things, lots of midi files grabbed
> from newsgroups using newsfetch, some mp3,... too much files that I
> have put in some dirs (just for having one idea, >find /data|wc -l gives
> me 128291... that's too much for hand...).
>
> What I want to do is to find the files that are the same, a good start
> could be the files which have same name and same size, better would be
> to find files that are same size (I have for examples a lot of 1.mid...)
> and then do a kind of diff between then and if there are the same, rm
> the copies).
>
> I have read half of the python tutorials and I don't know how to
> begin...
>
> Would it be a good idea to create a files which contains the
> path,filename,size,md5sum and then working on it?
>
> Has someone another idea or as someone already programmed that?
>
> Thanks you very much,
>
> Greg
>
>
> Sent via Deja.com http://www.deja.com/
> Before you buy.
Here is something I wrote a long time ago...
#! /usr/bin/env python
# Find duplicate files
#------------------------------------------------------------------------------
def scan_files(files_by_size, dir):
import os, stat
for file in os.listdir(dir):
file = os.path.join(dir, file)
try:
statbuf = os.stat(file)
except:
continue
if stat.S_ISDIR(statbuf[stat.ST_MODE]):
scan_files(files_by_size, file)
else:
size = statbuf[stat.ST_SIZE]
if not files_by_size.has_key(size): files_by_size[size] = []
files_by_size[size].append(file)
#------------------------------------------------------------------------------
def get_digest(file):
import md5
m = md5.new()
try:
f = open(file, "rb")
while 1:
buffer = f.read(64 * 1024)
if not buffer: return m.digest()
m.update(buffer)
except:
return ''
#------------------------------------------------------------------------------
import sys
files_by_size = {}
for dir in sys.argv[1:]:
scan_files(files_by_size, dir)
for size in files_by_size.keys():
if len(files_by_size[size]) > 1:
files_by_digest = {}
for file in files_by_size[size]:
digest = get_digest(file)
if digest:
if not files_by_digest.has_key(digest): files_by_digest[digest] = []
files_by_digest[digest].append(file)
for digest in files_by_digest.keys():
if len(files_by_digest[digest]) > 1:
print "\nDuplicate files:"
for file in files_by_digest[digest]:
print file
--
Dieter Deyke
mailto:deyke at cocreate.fc.hp.com mailto:deyke at crosswinds.net
Vs lbh pna ernq guvf, lbh unir jnl gbb zhpu gvzr.
More information about the Python-list
mailing list