[spambayes-dev] Standalone SpamBayes classifier for websites
Tony Meyer
spambayes-dev at tangomu.com
Wed May 16 09:29:57 CEST 2007
> Alas, sb_server.py and sb_imapfilter.py don't seem to share a lot
> of code
> (save for using Dibbler to build the web user interface). Is that
> true?
Somewhat. Unfortunately, at the time I originally wrote
sb_imapfilter I didn't use IMAP myself and so when deciding whether
the IMAP solution should be a 'filter' (i.e. periodically connect to
an IMAP server and classify messages) or a proxy (i.e. intercept
connections to the server and classify on the fly) I went with the
majority vote. In many ways a proxy would be the simpler solution,
and would certainly resemble sb_server a lot more (I've written an
IMAP proxy for another project, based on the sb_server POP3 proxy,
and there is a lot of overlap). OTOH, there are advantages
(particularly training) to the filter method.
> It seems the user interface, classifier bits and storage should be
> essentially
> identical.
The user interface is nearly identical. The shared part is in
UserInterface.py, with the separate subclasses in ProxyUI.py (POP3)
and ImapUI.py. The majority of the code in ImapUI.py deals with
presenting a list of folders from an IMAP server to the user, to
select which should be scanned for messages to classify/train (this
probably wouldn't be necessary with a proxy). The majority of the
code in ProxyUI.py deals with the browser-based training interface
(which the IMAP filter doesn't have - you just put messages in the
appropriate folders on the server).
The classifier and storage bits are pretty much identical (storage.py
and FileCorpus.py respectively).
> Any ideas on the shortest route to a core server that provides the
> user,
> training and storage interfaces? Start from scratch? Rip the POP3
> stuff
> out of sb_server.py? Rip the IMAP stuff out of sb_imapfilter.py? I'd
> really hate to reinvent the wheel since we seem to have two wheels
> already.
> Once that core server is available, adapting to different environments
> should be possible by plugging in specific protocol adapters
Definitely don't start with sb_imapfilter.py - it's basically a
scanner, not an on-demand-classifier.
Probably the best place to start would be with the State class in
sb_server.py. There are some POP3-specific parts in there, but
personally I would be happy if they were abstracted out (e.g. a State
class and a POP3ProxyState subclass). I could do that
(promptly ;)). What you then have are:
* State.bayes (the classifier)
* State.hamCorpus, State.spamCorpus, State.unknownCorpus (storage
of 'messages' - untrained messages in unknownCorpus, and trained
messages (expiring) in ham/spamCorpus).
* Training via moving messages between corpora.
Once you've got something that looks like a message, you can do
something like sb_server's onRetr for classification and storage
(I've cut bits that probably aren't relevant):
"""
msg = email.message_from_string(messageText,
_class=spambayes.message.SBHeaderMessage)
msg.setId(state.getNewMessageName())
# Now find the spam disposition and add the header.
(prob, clues) = state.bayes.spamprob(msg.tokenize(), evidence=True)
msg.addSBHeaders(prob, clues)
cls = msg.GetClassification()
state.RecordClassification(cls, prob)
# Cache the message. Write the message into the Unknown cache.
makeMessage = state.unknownCorpus.makeMessage
message = makeMessage(msg.getId(), msg.as_string())
state.unknownCorpus.addMessage(message)
"""
For the user interface, you can just create a
UserInterface.UserInterface subclass (needs a Home page/method, and
an __init__ method). Actually, you probably want
ProxyUI.ProxyUserInterface as-is, with a different set of options to
offer in the configuration pages (the parm_ini_map and adv_map used
in the __init__). (There would be a "No POP3 proxies running"
message on the main page, but you could ignore that or subclass
appropriately).
It's a long time since I've worked with the browser interface code,
but I'm pretty sure that this would give you what you want.
Cheers,
Tony
More information about the spambayes-dev
mailing list