Howto find same files?

Dieter Deyke deyke at cocreate.fc.hp.com
Sat Oct 28 18:19:53 EDT 2000


gregoire.favre at ima.unil.ch writes:

> Hello,
>
> two friends tell me that I should go to python to solve my problem:
> I have fetched some files (quite a lots) that I have put in /data (a
> lots of patchxxx.{gz,bz2} of lots of things, lots of midi files grabbed
> from newsgroups using newsfetch, some mp3,... too much files that I
> have put in some dirs (just for having one idea, >find /data|wc -l gives
> me 128291... that's too much for hand...).
>
> What I want to do is to find the files that are the same, a good start
> could be the files which have same name and same size, better would be
> to find files that are same size (I have for examples a lot of 1.mid...)
> and then do a kind of diff between then and if there are the same, rm
> the copies).
>
> I have read half of the python tutorials and I don't know how to
> begin...
>
> Would it be a good idea to create a files which contains the
> path,filename,size,md5sum and then working on it?
>
> Has someone another idea or as someone already programmed that?
>
> Thanks you very much,
>
>       Greg
>
>
> Sent via Deja.com http://www.deja.com/
> Before you buy.

Here is something I wrote a long time ago...

#! /usr/bin/env python

# Find duplicate files

#------------------------------------------------------------------------------

def scan_files(files_by_size, dir):
	import os, stat
	for file in os.listdir(dir):
		file = os.path.join(dir, file)
		try:
			statbuf = os.stat(file)
		except:
			continue
		if stat.S_ISDIR(statbuf[stat.ST_MODE]):
			scan_files(files_by_size, file)
		else:
			size = statbuf[stat.ST_SIZE]
			if not files_by_size.has_key(size): files_by_size[size] = []
			files_by_size[size].append(file)

#------------------------------------------------------------------------------

def get_digest(file):
	import md5
	m = md5.new()
	try:
		f = open(file, "rb")
		while 1:
			buffer = f.read(64 * 1024)
			if not buffer: return m.digest()
			m.update(buffer)
	except:
		return ''

#------------------------------------------------------------------------------

import sys
files_by_size = {}
for dir in sys.argv[1:]:
	scan_files(files_by_size, dir)
for size in files_by_size.keys():
	if len(files_by_size[size]) > 1:
		files_by_digest = {}
		for file in files_by_size[size]:
			digest = get_digest(file)
			if digest:
				if not files_by_digest.has_key(digest): files_by_digest[digest] = []
				files_by_digest[digest].append(file)
		for digest in files_by_digest.keys():
			if len(files_by_digest[digest]) > 1:
				print "\nDuplicate files:"
				for file in files_by_digest[digest]:
					print file

--
Dieter Deyke
mailto:deyke at cocreate.fc.hp.com mailto:deyke at crosswinds.net
Vs lbh pna ernq guvf, lbh unir jnl gbb zhpu gvzr.



More information about the Python-list mailing list