[Q:] hash table performance!

Wed Jun 14 08:19:13 EDT 2000

On Wed, 14 Jun 2000 09:56:12 GMT, liwen_cao at my-deja.com
<liwen_cao at my-deja.com> wrote:

>P.S. Background of duplicate files check:
>My way of doing it is simple: walk the directories and files, do MD5
>coding for every file, use the MD5 code as the hash key, insert the
>file name into the hash table. When two files has the same key, check
>the contents by bytes.

Well, odds are that the slowness is coming from MD5, not the actual
dictionary implimentation.  Hashing in Python is what we in the
community call "damn fast" (this is a technical term, you are not
expected to understand it).

Were it me, and I was more interested in content than absolute
perfection, I'd use a less stringent content hash than MD5;as it
stands, you're reading files which are content-similar twice on every
run, once to gen the MD5, a second time, byte-for-byte, if it finds a
match.  I'm going to wager heavily the MD5 alone should be sufficient
in itself to determine similarity in any set of files you're likely to
have.  Are you, in fact, seeing several files with the same MD5 and
different content, or are you being paranoid above and beyond the call
of duty?

-- 
Alexander Williams (thantos at gw.total-web.net)           | In the End,
  "Join the secret struggle for the soul of the world." | Oblivion
  Nobilis, a new Kind of RPG                            | Always
  http://www.chancel.org                                | Wins