[Q:] hash table performance!
Gordon McMillan
gmcm at hypernet.com
Wed Jun 14 09:05:20 EDT 2000
liwen_cao at my-deja.com wrote:
>I'm doing a project on large volume information processing. One of the
>tasks is to find out the duplicated files under a directory. I believe
>Python would be a good powerful tool for that(yes it is, I've
>implemented it in 40 lines of codes). However, performance IS a
>problem! Since I'm using directory as the hash table (hash_table={}...),
>I doubt the bottle neck is in the hash table: How can the generic hash
>table fits every cases?
Some of us have benchmarked Python's hash implementation against a bunch of
other implementations. Haven't found anything that comes close.
>Is there a way of customize the length, hash algorithm of the hash
>table in python? Or anyone can describe how the python hash table
>works.
Object/dictobject.c
>My way of doing it is simple: walk the directories and files, do MD5
>coding for every file, use the MD5 code as the hash key, insert the
>file name into the hash table. When two files has the same key, check
>the contents by bytes.
If you profiled the above (check out the profile module), you'd find that
hashing is an insignificant part of the process.
- Gordon
More information about the Python-list
mailing list