Graham's spam filter (was Lisp to Python translation criticism?)
cbbrowne at acm.org
Sun Aug 18 00:04:09 CEST 2002
Paul Rubin <phr-n2002b at NOSPAMnightsong.com> wrote:
> Erik Max Francis <max at alcyone.com> writes:
>> One obvious and immediate issue is that for an industrial-strength
>> filter, the database gets _huge_ (Graham's basic setup involved 4000
>> messages each in the spam and nonspam corpora), and reading and writing
>> the database (even with cPickle) each time a spam message comes through
>> starts to become intensive.
> Why not use dbhash? I think there's also a Python cdb wrapper somewhere.
cdb should be _really_ good for it.
By the way, _my_ setup, with Ifile, involves a corpus of tens of
thousands of messages, that probably exceeds 500MB.
Ifile distills that down to a "corpus file" about 7.5MB long.
(reverse (concatenate 'string "gro.mca@" "enworbbc"))
"What's wrong with 3rd party tools? Especially if they are free? What
the **** do you think Unix is anyway? It's a big honkin' party of 3rd
party free tools." -- Bob Cassidy (rmcassid at uci.edu)
More information about the Python-list