a program to delete duplicate files

John Machin sjmachin at lexicon.net
Sat Mar 12 18:40:19 EST 2005


Patrick Useldinger wrote:
> John Machin wrote:
>
> > Maybe I was wrong: lawyers are noted for irritating precision. You
> > meant to say in your own defence: "If there are *any* number (n >=
2)
> > of identical hashes, you'd still need to *RE*-read and *compare*
...".
>
> Right, that is what I meant.
>
> > 2. As others have explained, with a decent hash function, the
> > probability of a false positive is vanishingly small. Further,
nobody
> > in their right mind [1] would contemplate automatically deleting
n-1
> > out of a bunch of n reportedly duplicate files without further
> > investigation. Duplicate files are usually (in the same directory
with
> > different names or in different-but-related directories with the
same
> > names) and/or (have a plausible explanation for how they were
> > duplicated) -- the one-in-zillion-chance false-positive should
stand
> > out as implausible.
>
> Still, if you can get it 100% right automatically, why would you
bother
> checking manually?

A human in their right mind is required to decide what to do with the
duplicates. The proponents of hashing -- of which I'm not one -- would
point out that any false-positives would be picked up as part of the
human scrutiny.

> Why get back to argments like "impossible",
> "implausible", "can't be" if you can have a simple and correct answer
-
> yes or no?

Oh yeah, "the computer said so, it must be correct". Even with your
algorithm, I would be investigating cases where files were duplicates
but there was nothing in the names or paths that suggested how that
might have come about.

>
> Anyway, fdups does not do anything else than report duplicates.
> Deleting, hardlinking or anything else might be an option depending
on
> the context in which you use fdups, but then we'd have to discuss the

> context. I never assumed any context, in order to keep it as
universal
> as possible.

That's very good, but it wasn't under contention.

>
> > Different subject: maximum number of files that can be open at
once. I
> > raised this issue with you because I had painful memories of having
to
> > work around max=20 years ago on MS-DOS and was aware that this
magic
> > number was copied blindly from early Unix. I did tell you that
> > empirically I could get 509 successful opens on Win 2000 [add 3 for
> > stdin/out/err to get a plausible number] -- this seems high enough
to
> > me compared to the likely number of files with the same size -- but
you
> > might like to consider a fall-back detection method instead of just
> > quitting immediately if you ran out of handles.
>
> For the time being, the additional files will be ignored, and a
warning
> is issued. fdups does not quit, why are you saying this?

I beg your pardon, I was wrong. Bad memory. It's the case of running
out of the minuscule buffer pool that you allocate by default where it
panics and pulls the sys.exit(1) rip-cord.

>
> A fallback solution would be to open the file before every _block_
read,
> and close it afterwards.

Ugh. Better use more memory, so less blocks!!

> In my mind, it would be a command-line option,
> because it's difficult to determine the number of available file
handles
> in a multitasking environment.

The pythonic way is to press ahead optimistically and recover if you
get bad news.

>
> Not difficult to implement, but I first wanted to refactor the code
so
> that it's a proper class that can be used in other Python programs,
as
> you also asked.

I didn't "ask"; I suggested. I would never suggest a
class-for-classes-sake. You had already a singleton class; why
another". What I did suggest was that you provide a callable interface
that returned clusters of duplicates [so that people could do their own
thing instead of having to parse your file output which contains a
mixture of warning & info messages and data].

> That is what I have sent you tonight. It's not that I
> don't care about the file handle problem, it's just that I do changes
by
> (my own) priority.
>
> > You wrote at some stage in this thread that (a) this caused
problems on
> > Windows and (b) you hadn't had any such problems on Linux.
> >
> > Re (a): what evidence do you have?
>
> I've had the case myself on my girlfriend's XP box. It was certainly
> less than 500 files of the same length.

Interesting. Less on XP than on 2000? Maybe there's a machine-wide
limit, not a per-process limit, like the old DOS max=20. What else was
running at the time?

>
> > Re (b): famous last words! How long would it take you to do a test
and
> > announce the margin of safety that you have?
>
> Sorry, I do not understand what you mean by this.

Test:
!for k in range(1000):
!    open('foo' + str(k), 'w')
Announce:
"I can open A files at once on box B running os C. The most files of
the same length that I have seen is D. The ratio A/D is small enough
not to worry."

Cheers,
John




More information about the Python-list mailing list