dict.hash - optimized per module

Hi, My name is Jan and this is my first post on this group. So hello :) I'm very sorry if my idea is so naive as to be ridiculous but I believe it is worth to ask. I'm just watched "The Mighty Dictionary" video conference from Atlanta deliver by Brandon Craig Rhodes. After watching I made graph, using presented at conference library dictinfo, for __builtin__.__dict__. When I saw few collisions I think "Why this module doesn't have their own hashing function implementation which allow to avoid collision in this set of names?". My second think was "Why each Python module doesn't have their own internal hashing function which doesn't produce collisions in scope of his names". Maybe my thoughts was silly but is this doesn't speed Python a little? I'm aware that this doesn't work for locals or globals dict but may be an improvement in places where set of names is constant or predictable like builtins Python modules. What do You think? Greetings from Poland, --
<> Jan Koprowski

On Sun, 17 Oct 2010 06:27:37 pm Jan Koprowski wrote:
Python 2.6 has 143 builtin names, and zero collisions:
Firstly, the occasional collision doesn't matter much. Secondly, your idea would mean that every module would need it's own custom-made hash function. Writing good hash functions is hard. The Python hash function is very, very good. Expecting developers to produce *dozens* of hash functions equally as good is totally impractical. -- Steven D'Aprano

On Sun, Oct 17, 2010 at 2:41 AM, Steven D'Aprano <steve@pearwood.info> wrote:
Actually, there's already software to automatically generate such functions; e.g. http://www.gnu.org/software/gperf/ Not that this makes the suggestion any more tractable though. Cheers, Chris

On Sun, 17 Oct 2010 20:41:34 +1100 Steven D'Aprano <steve@pearwood.info> wrote:
It depends what you call collisions. Collisions during bucket lookup, or during hash value comparison (that is, after you selected a bucket)? For the former, here is the calculation assuming an overallocation factor of 4 (which, IIRC, is the one used in the dict implementation):
The real answer here is that Python needs hash values to be globally valid. Both for semantics (module dicts are regular dicts and should be usable as such), and for efficiency (having an unique hash function means the precalculated hash value can be stored for critical types such as str). Regards Antoine.

On 10/17/2010 3:27 AM, Jan Koprowski wrote:
Worth asking but not worth doing (or, in a sense, already done for function local namespaces). As Antoine said, strings have their hash computed just once. Recomputing a namespace-depending hash for each lookup would take far longer than the occational collision. For function local names, names are assigned a index at compile time so that runtime lookup is a super-quick index operation. If you want, call it perfect hashing with hashes computed once at compile time ;-). -- Terry Jan Reedy

On Sun, 17 Oct 2010 06:27:37 pm Jan Koprowski wrote:
Python 2.6 has 143 builtin names, and zero collisions:
Firstly, the occasional collision doesn't matter much. Secondly, your idea would mean that every module would need it's own custom-made hash function. Writing good hash functions is hard. The Python hash function is very, very good. Expecting developers to produce *dozens* of hash functions equally as good is totally impractical. -- Steven D'Aprano

On Sun, Oct 17, 2010 at 2:41 AM, Steven D'Aprano <steve@pearwood.info> wrote:
Actually, there's already software to automatically generate such functions; e.g. http://www.gnu.org/software/gperf/ Not that this makes the suggestion any more tractable though. Cheers, Chris

On Sun, 17 Oct 2010 20:41:34 +1100 Steven D'Aprano <steve@pearwood.info> wrote:
It depends what you call collisions. Collisions during bucket lookup, or during hash value comparison (that is, after you selected a bucket)? For the former, here is the calculation assuming an overallocation factor of 4 (which, IIRC, is the one used in the dict implementation):
The real answer here is that Python needs hash values to be globally valid. Both for semantics (module dicts are regular dicts and should be usable as such), and for efficiency (having an unique hash function means the precalculated hash value can be stored for critical types such as str). Regards Antoine.

On 10/17/2010 3:27 AM, Jan Koprowski wrote:
Worth asking but not worth doing (or, in a sense, already done for function local namespaces). As Antoine said, strings have their hash computed just once. Recomputing a namespace-depending hash for each lookup would take far longer than the occational collision. For function local names, names are assigned a index at compile time so that runtime lookup is a super-quick index operation. If you want, call it perfect hashing with hashes computed once at compile time ;-). -- Terry Jan Reedy
participants (5)
-
Antoine Pitrou
-
Chris Rebert
-
Jan Koprowski
-
Steven D'Aprano
-
Terry Reedy