Bloom Filter in 22 lines of Python (updated)

Wed Jun 8 16:48:53 EDT 2011

On Wed, Jun 8, 2011 at 1:05 PM, Raymond Hettinger <python at rcn.com> wrote:
> On Jun 6, 10:47 am, geremy condra <debat... at gmail.com> wrote:
>> On Fri, Jun 3, 2011 at 1:17 PM, Raymond Hettinger <pyt... at rcn.com> wrote:
>> > Thanks for all the feedback on the earlier post.
>>
>> > I've updated the recipe to use a cleaner API, simpler code,
>> > more easily subclassable, and with optional optimizations
>> > for better cache utilization and speed:
>>
>> >  http://code.activestate.com/recipes/577684-bloom-filter/
>>
>> Any chance of this landing in collections?
>
> I would be happy to add something like this to collections
> once the API matures.
>
> Right now, the constructor expects the user to know the desired
> memory size and number of probes.  Those could be computed
> from the maximum number of unique keys and the tolerable
> error false positive rate.
>
> The most convenient constructor could be:
>    b = BloomFilter(myseq)
> where len(myseq) is presumed indicate the maximum number of
> unique entries and there is a default tolerable false positive
> rate of 1 in 10,000.
>
> The API already lets the user substitute different methods
> for get_probes().  It should also allow the bytearray to be
> easily replaced by other objects with indexable access
> (array objects, mmap objects, etc).
>
> We could also provide union, intersection, and difference
> operations but these would be a little awkward for general
> purpose use because the result is only meaningful when
> the sizes, probe functions, and input domain are all the same.
>
> Fortunately, there is a lot of prior art, so the API design
> can be informed by what has worked well for others.

Seems like the code that Dan Stromberg posted addresses a lot of this.
Thoughts on his code?

Also, if you're interested I'm about halfway through writing a C
implementation, although it would have to be somewhat less flexible to
get a real speed advantage. Right now there's a hefty enough speed
penalty to the filter (the states test takes about 150 times as long
with a bloom filter as with a normal set) that it may make sense to
have a base case that uses an optimized C implementation but that can
be overridden as needed.

Geremy Condra