Graham's spam filter
Oren Tirosh
oren-py-l at hishome.net
Fri Aug 23 04:24:20 EDT 2002
On Fri, Aug 23, 2002 at 04:42:12AM +0000, Christopher Browne wrote:
> This model is in effect like a "database server" model. You start up
> a DBMS process once, and it loads in a bunch of data. Once in memory,
> access is quick, much moreso than if you have to keep reading the data
> in over and over again.
>
> Cacheing is not a meaningful objection to that; part of the cost of
> loading in data is in parsing what's on disk. Not parsing the data a
> bunch of times is The Win.
My original proposal was to mmap a hash table into memory. Let's assume
that the hash file looks like this:
hash table size = N
N*hash table entry - each entry contains 32 bit hash and file offset
M<N variable size word entries:
full word
counts, probabilities, expiration info, etc
You just mmap the thing into memory and for each word encountered you
calculate its hash, look it up in the hash table, fetch the offset and
go to the entry. No parsing pass required. Blocks that are not accessed
will not even be loaded into memory. No daemons. No interprocess
communication.
Client-server lets you store the database on one machine and do the
filtering on another. When people start writing angry emails about not
being able to do that and threatening to sell your address to every
spammer on earth that's the time to start thinking about client server
models, not before that :-)
Oren
More information about the Python-list
mailing list