[Spambayes] spambayes fronting a mailing list?
skip at pobox.com
Thu Jan 16 06:31:43 EST 2003
BAW> This ought to work fairly well, I think, modulo the training issue.
BAW> My idea was to not train the list at all, before turning on
BAW> spambayes. So the first batch of messages will all get held as
BAW> unsure, and you'd use the admindb page to accept and reject
BAW> messages. Accept messages would train as ham and rejected messages
BAW> would get trained as spam.
In my case I sidestepped training altogether because the list's content is a
subset of the stuff I'm interested in anyway. Most of the "spam" messages
encountered by the list at this point are really of the virus/worm variety,
and since it's set up for members only posting, little, if any garbage
actually gets through to the list, even without using spambayes.
BAW> The u/i for these options is undecided -- maybe you have an
BAW> additional "train as..." radio button. I don't think this matters
BAW> much right now.
One reason I'm interested in separating pop3proxy into two functions ( POP
retrieval/classifying and training/web UI) is that the training/web
component should be useful for other spambayes users. Right now in my
current environment, training is clunky enough that I only train on unsures
and mistakes. While that works okay because my starting corpus was so large
(around 20,000 messages) the indications from people who've experimented
with that sort of training is that the quality of classification does
degrade over time.
Last night I ripped out the POP stuff from pop3proxy, renamed the result
proxytrainer and added one extra method, onUpload. Then I wrote a simple
proxytee.py script which passes stdin to stdout and uploads the message it
received to http://localhost:8880/upload as a file upload (in theory,
allowing upload of large mbox files). The mbox upload doesn't seem to be
quite working yet and there's still that pesky infinite loop in onReview,
but I have hope it will eventually work pretty well. At that point, anyone
should be able to use it as a training interface. All they will need is a
tee-type hook they can insert into their mail transport somewhere.
A bit further down the road, I will probably dump the asyncore stuff in
favor of something based on SimpleHTTPServer just to reduce the number of
lines of code. Without the POP stuff going on there's no great need for the
channel multiplexing. Even without threading, the amount of work the server
would have to do per click on the user interface is minimal.
BAW> So as your list warms up, you'll be training the system. I wonder
BAW> how long it'll take before spambayes gets pretty good at detecting
BAW> what's appropriate and what's not for your list?
Like I indicated, I gave it a head start. ;-)
More information about the Spambayes