[Mailman-Developers] Re: [Spambayes] spambayes fronting a mailing list?

Thu Jan 16 18:45:58 EST 2003

[I added mailman-developers to this list because I think people will
be interested in my prototype integration of Mailman and spambayes, a
statistical learning classifier, which I've targeted for spam fighting
on Mailman lists.  -BAW].

>>>>> "SM" == Skip Montanaro <skip at pobox.com> writes:

    SM> In my case I sidestepped training altogether because the
    SM> list's content is a subset of the stuff I'm interested in
    SM> anyway.  Most of the "spam" messages encountered by the list
    SM> at this point are really of the virus/worm variety, and since
    SM> it's set up for members only posting, little, if any garbage
    SM> actually gets through to the list, even without using
    SM> spambayes.

I suspect python.org will be similar, since we have many other spam
defenses in place.  I've just been playing with my prototype, and
yeah, it sure learns fast even with no a-priori training.  I'm not
100% a train-on-the-fly approach will work, so it's worth some real
world banging.

In my simplified approach, you start out holding all unsure and spam.
Legit messages will hit one of those first, likely unsure if your list
wasn't advertised on Usenet before real people started posting
<wink>.  There's one extra button on the admindb page called
"Train?".  Click this if you want to train a held message based on
your action.  If you approve the message, it gets trained as ham, and
if you reject or discard it, it gets trained as spam.

Within about 10 messages (first a bunch of ham, then a random and
unscientific barrage <wink> of spam and ham) the classifier was doing
pretty good.  It was catching all the spam and letting through most of
the ham.  The ham recognition definitely went up as I approved more
messages.

False positives get caught on the admindb screen, so you approve and
train them in one action.  Although I never saw any false negatives, I
think the way to handle these will be to add a -spam address that
people can send messages to.  If the list admin sends it then it gets
spam trained.  If not, the list admin will have a chance to decide
whether to spam train it or not.

    SM> One reason I'm interested in separating pop3proxy into two
    SM> functions ( POP retrieval/classifying and training/web UI) is
    SM> that the training/web component should be useful for other
    SM> spambayes users.  Right now in my current environment,
    SM> training is clunky enough that I only train on unsures and
    SM> mistakes.  While that works okay because my starting corpus
    SM> was so large (around 20,000 messages) the indications from
    SM> people who've experimented with that sort of training is that
    SM> the quality of classification does degrade over time.

That's an important point.  While I'm not sure that with my approach
the quality of classification will improve over time <wink>, I think a
training regimen integrated with the admindb stuff will be the most
natural for a Mailman list admin.

BTW, the hammie.py interface was all I needed for my prototype.  One
reason for going with hammie is that each mailing list needs its own
database, and I can just create a Hammie, associate it with a list,
and tie it easily into Mailman's load/save mechanism.

-Barry