Best way to hash a dictionary

Miki Tebeka tebeka at cs.bgu.ac.il
Tue Mar 4 04:33:08 EST 2003


Hello Alex,

> Can you please clarify for us why your desired mapping -- a separate
> hash table where, e.g.,
> 
> {'POS':'Verb', 'Gender':'Male', 'Tense':'Past'}  -->  extrainfo
> 
> is necessarily more efficient than adding a non-otherwise-occurring
> key to the dict object and having, e.g.:
> 
> {'POS':'Verb', 'Gender':'Male', 'Tense':'Past', 'Xinfo':extrainfo}
I have a database of words, each word is assigned several possible
tags.
Some of the words have only one tag assigned to them. I'd like to know
the
number of these words and update it from time to time.
This is why I need tag->count hash.
You can see the code at
http://www.cs.bgu.ac.il/~tebeka/University/tpost.tar.bz2
(Bit outdated but enough for this discussion)

> ???  This is not entirely clear to me from your posts so far.  Similar
> doubts apply to many other possible structures -- e.g. why would it
> be unsuitable to have the tag be (('POS', 'Verb'), ('Gender', 'Male'), 
> ('Tense', 'Past')), a tuple you'd have no problem using as a key in 
> any other dict.
Performance counts here. I'll do many lookups for certain parts of
tags throghout the compuatation.

> Basically, the fact that you tag words with sets of (tagname,tagvalue)
The tag is the whole hash. I'll call the key-value pair a feature of
the tag.

> But, this is by the by -- just trying to get a better perspective
> on your issues.  
Looks like you'll know them better than me in nowtime :-)

> >> Finally: the reason you can't just use the dictionaries as keys, of
> >> course, is that dictionaries are mutable. Does you design preclude any
> >> change to the dicionaries after they become keys in the other dictionary?
> >> If so then you may just be able to subclass the standard dict type.
> > This has performance penalties, and since an average run of my thesis
> 
> I'm confused -- how do you *know* you'd have performance issues here?
It might not be the main performance bottleneck but everything that
will
slow me down is usually bad.

> Well of course, you haven't added any __hash__ method -- what did you
> expect?!  
Next time I'll read the help more. 

> Surely you wouldn't expect Python to add a __hash__ method
> automatically for you...?  
Nope. I have C++ for such things ;-)

> From such an observation as this I suspect
> we may be talking entirely at cross-purposes.
> 
> The obvious idea would be something like:
> 
> class HashableDict(dict):
>     def __init__(self, *args, **kwds):
>         dict.__init__(self, *args, **kwds)
>         my_hash = 0
>         for key in self: my_hash ^= hash(key)
>         self.my_hash = my_hash
>     def __hash__(self):
>         return self.my_hash
> 
> plus perhaps some precaution to ensure instances of HashableDict are
> never modified after their __init__, but that's basically just to
> ward against possible bugs, so let's skip it for now.  So where's
> the performance penalty here?  
When you use a user defined class instread of built in type you lose
some
performance (3% from my testing on hashes)

> What does a profile of your 5-days run tell you about where your
> code is spending its time currently?
I didn't profile the 5 day run since it'll make it even longer.
>From a shorter test the main bottleneck is in incontext function.
(You'll have to grab the code for that).

Thanks for all the help.
Miki




More information about the Python-list mailing list