Graham's spam filter
hewu5001 at stud.uni-saarland.de
Thu Aug 22 10:50:01 EDT 2002
Am Don, 2002-08-22 um 10.24 schrieb Oren Tirosh:
> On Wed, Aug 21, 2002 at 10:48:59PM -0700, Erik Max Francis wrote:
> > I don't that this is necessarily true; certainly and without a doubt,
> > reloading the _entire_ database each time is a non-starter. The
> > possibility of using a gdbm or similar database system might shorten
> > those times to very reasonable amounts, but this is something I haven't
> > researched yet.
> Reloading the entire database is not necessarily a non-starter. If the
> database is represented as some kind of hash table in a linear memory block
> without using any pointers it can be mmapped. The page cache will take
> care of the rest. I think this is easier to implement and manage than a
> client-server solution. I won't be surprised if it's faster, too.
Might be faster, using an mmapped gdbm database or the like. I tried
using gdbm directly, but under high load, the program responds too slow
for my taste and also burns up much too much CPU-time (we don't have
access to better hardware than a K62-400 as our main server doing
Another thing that made me consider a client/server based solution is
the fact that you can then build a central probabilities database; this
(I think) solves many concerns that people have raised about training
Of course a central database can only be useful in a closed unit; it
would be pointless to share my data (which mainly consists of german
spam) with someone who lives in the U.S., as german spam should be quite
unlikely for them.
> This assumes that updating the probabilities database is a batch operation
> done periodically that creates a new databsae and then does a rename and
That's how it's done in the current model server.
Well, I'll see what comes out of my efforts. Maybe it'll actually prove
to be useful.
Universität 18 - Zimmer 2206 - Saarbrücken
More information about the Python-list