[Q:] hash table performance!

Wed Jun 14 09:05:20 EDT 2000

liwen_cao at my-deja.com wrote:

>I'm doing a project on large volume information processing. One of the
>tasks is to find out the duplicated files under a directory. I believe
>Python would be a good powerful tool for that(yes it is, I've
>implemented it in 40 lines of codes). However, performance IS a
>problem! Since I'm using directory as the hash table (hash_table={}...),
>I doubt the bottle neck is in the hash table: How can the generic hash
>table fits every cases?

Some of us have benchmarked Python's hash implementation against a bunch of 
other implementations. Haven't found anything that comes close.

>Is there a way of customize the length, hash algorithm of the hash
>table in python?  Or anyone can describe how the python hash table
>works.

Object/dictobject.c

>My way of doing it is simple: walk the directories and files, do MD5
>coding for every file, use the MD5 code as the hash key, insert the
>file name into the hash table. When two files has the same key, check
>the contents by bytes.

If you profiled the above (check out the profile module), you'd find that 
hashing is an insignificant part of the process.

- Gordon