Today's fun and educational Python recipe
Raymond Hettinger
python at rcn.com
Wed May 4 17:39:28 EDT 2011
On May 4, 12:42 pm, Terry Reedy <tjre... at udel.edu> wrote:
> On 5/4/2011 2:17 PM, Raymond Hettinger wrote:
>
> > Here's a 22-line beauty for a classic and amazing algorithm:
> >http://bit.ly/bloom_filter
>
> > The wiki article on the algorithm is brief and well-written:
> >http://en.wikipedia.org/wiki/Bloom_filter
>
> As I understand the article, the array of num_bits should have
> num_probes (more or less independent) bits set for each key. But as I
> understand the code
>
> for i in range(self.num_probes):
> h, array_index = divmod(h, num_words)
> h, bit_index = divmod(h, 32)
> yield array_index, 1 << bit_index
>
> the same bit is being set or tested num_probes times. The last three
> lines have no dependence on i that I can see, so they appear to do the
> same thing each time. This seems like a bug.
The 512 bits in h are progressively eaten-up between iterations. So
each pass yields a different (array index, bit_mask) pair.
It's easy to use the interactive prompt to show that different probes
are produced on each pass:
>>> bf = BloomFilter(num_bits=1000, num_probes=8)
>>> pprint(list(bf.get_probes('Alabama')))
[(19, 1073741824),
(11, 64),
(9, 134217728),
(25, 1024),
(24, 33554432),
(6, 16),
(7, 16777216),
(22, 1048576)]
The 512 bits are uncorrelated -- otherwise sha512 wouldn't be much of
a cryptographic hash ;)
The fifty state example in the recipe is a reasonable demonstration
that the recipe works as advertised. It successfully finds all fifty
states (the true positives) and it tries 100,000 negatives resulting
in only a handful of false negatives. That should be somewhat
convincing that it all works.
Raymond
-------
follow my other python tips and recipes on twitter: @raymondh
More information about the Python-list
mailing list