Graham's spam filter

Oren Tirosh oren-py-l at hishome.net
Fri Aug 23 04:24:20 EDT 2002


On Fri, Aug 23, 2002 at 04:42:12AM +0000, Christopher Browne wrote:
> This model is in effect like a "database server" model.  You start up
> a DBMS process once, and it loads in a bunch of data.  Once in memory,
> access is quick, much moreso than if you have to keep reading the data
> in over and over again.
> 
> Cacheing is not a meaningful objection to that; part of the cost of
> loading in data is in parsing what's on disk.  Not parsing the data a
> bunch of times is The Win.

My original proposal was to mmap a hash table into memory. Let's assume 
that the hash file looks like this:

hash table size = N
N*hash table entry - each entry contains 32 bit hash and file offset
M<N variable size word entries:
   full word
   counts, probabilities, expiration info, etc

You just mmap the thing into memory and for each word encountered you
calculate its hash, look it up in the hash table, fetch the offset and
go to the entry. No parsing pass required. Blocks that are not accessed 
will not even be loaded into memory.  No daemons. No interprocess 
communication. 

Client-server lets you store the database on one machine and do the
filtering on another. When people start writing angry emails about not 
being able to do that and threatening to sell your address to every 
spammer on earth that's the time to start thinking about client server 
models, not before that :-)

	Oren





More information about the Python-list mailing list