[spambayes-dev] Deprecated options

Graham Toal gtoal at gtoal.com
Fri Aug 6 14:31:20 CEST 2004


Tim Peters <tim.peters at gmail.com> wrote:
> > I know Bill Y. (CRM-144's creator) used to participate here, perhaps he
> > could offer some ideas. To me, using SBPH to generate tokens for
> > SpamBayes seems like it would be fairly straightforward. The rest of
> > SpamBayes would stay mostly the same.
>
> It's easy to experiment with, but for practical application it needs a
> different database approach, to exploit the nature of the keys.

I had a hack at a different DB approach and although I admit I did not take
it as far as a working spam filter, the proof of concept implementation
was at least enough to convince me that it was an avenue worth exploring.

I wrote it up here:

http://www.gtoal.com/mt/archives/2004_02.html

and there is some sample code here:

http://www.gtoal.com/spam/devel-temp/tokra3.c.html

Without any knowlege of the structure of text at all, it was able to
intuit sequences such as '&#101;' as being symptomatic of spam.


Two conclusions:

1) You can afford to classify much longer sequences than simple n-grams,
   because if you use variable-length sequences, they're self-limiting.

2) The natural fit data structure for this is a 256-trie.  (specifically
   a DAWG but implemented as a trie rather than a DAG to allow easy additions)

regards

Graham


More information about the spambayes-dev mailing list