[spambayes-dev] Deprecated options
Graham Toal
gtoal at gtoal.com
Fri Aug 6 14:31:20 CEST 2004
Tim Peters <tim.peters at gmail.com> wrote:
> > I know Bill Y. (CRM-144's creator) used to participate here, perhaps he
> > could offer some ideas. To me, using SBPH to generate tokens for
> > SpamBayes seems like it would be fairly straightforward. The rest of
> > SpamBayes would stay mostly the same.
>
> It's easy to experiment with, but for practical application it needs a
> different database approach, to exploit the nature of the keys.
I had a hack at a different DB approach and although I admit I did not take
it as far as a working spam filter, the proof of concept implementation
was at least enough to convince me that it was an avenue worth exploring.
I wrote it up here:
http://www.gtoal.com/mt/archives/2004_02.html
and there is some sample code here:
http://www.gtoal.com/spam/devel-temp/tokra3.c.html
Without any knowlege of the structure of text at all, it was able to
intuit sequences such as 'e' as being symptomatic of spam.
Two conclusions:
1) You can afford to classify much longer sequences than simple n-grams,
because if you use variable-length sequences, they're self-limiting.
2) The natural fit data structure for this is a 256-trie. (specifically
a DAWG but implemented as a trie rather than a DAG to allow easy additions)
regards
Graham
More information about the spambayes-dev
mailing list