[spambayes-dev] Standalone SpamBayes classifier for websites

Tony Meyer spambayes-dev at tangomu.com
Wed May 16 09:29:57 CEST 2007


> Alas, sb_server.py and sb_imapfilter.py don't seem to share a lot  
> of code
> (save for using Dibbler to build the web user interface).  Is that  
> true?

Somewhat.  Unfortunately, at the time I originally wrote  
sb_imapfilter I didn't use IMAP myself and so when deciding whether  
the IMAP solution should be a 'filter' (i.e. periodically connect to  
an IMAP server and classify messages) or a proxy (i.e. intercept  
connections to the server and classify on the fly) I went with the  
majority vote.  In many ways a proxy would be the simpler solution,  
and would certainly resemble sb_server a lot more (I've written an  
IMAP proxy for another project, based on the sb_server POP3 proxy,  
and there is a lot of overlap).  OTOH, there are advantages  
(particularly training) to the filter method.

> It seems the user interface, classifier bits and storage should be  
> essentially
> identical.

The user interface is nearly identical.  The shared part is in  
UserInterface.py, with the separate subclasses in ProxyUI.py (POP3)  
and ImapUI.py.  The majority of the code in ImapUI.py deals with  
presenting a list of folders from an IMAP server to the user, to  
select which should be scanned for messages to classify/train (this  
probably wouldn't be necessary with a proxy).  The majority of the  
code in ProxyUI.py deals with the browser-based training interface  
(which the IMAP filter doesn't have - you just put messages in the  
appropriate folders on the server).

The classifier and storage bits are pretty much identical (storage.py  
and FileCorpus.py respectively).

> Any ideas on the shortest route to a core server that provides the  
> user,
> training and storage interfaces?  Start from scratch?  Rip the POP3  
> stuff
> out of sb_server.py?  Rip the IMAP stuff out of sb_imapfilter.py?  I'd
> really hate to reinvent the wheel since we seem to have two wheels  
> already.
> Once that core server is available, adapting to different environments
> should be possible by plugging in specific protocol adapters

Definitely don't start with sb_imapfilter.py - it's basically a  
scanner, not an on-demand-classifier.

Probably the best place to start would be with the State class in  
sb_server.py.  There are some POP3-specific parts in there, but  
personally I would be happy if they were abstracted out (e.g. a State  
class and a POP3ProxyState subclass).  I could do that  
(promptly ;)).  What you then have are:

  * State.bayes (the classifier)
  * State.hamCorpus, State.spamCorpus, State.unknownCorpus (storage  
of 'messages' - untrained messages in unknownCorpus, and trained  
messages (expiring) in ham/spamCorpus).
  * Training via moving messages between corpora.

Once you've got something that looks like a message, you can do  
something like sb_server's onRetr for classification and storage  
(I've cut bits that probably aren't relevant):

"""
     msg = email.message_from_string(messageText,
               _class=spambayes.message.SBHeaderMessage)
     msg.setId(state.getNewMessageName())

     # Now find the spam disposition and add the header.
     (prob, clues) = state.bayes.spamprob(msg.tokenize(), evidence=True)
     msg.addSBHeaders(prob, clues)
     cls = msg.GetClassification()
     state.RecordClassification(cls, prob)

     # Cache the message.  Write the message into the Unknown cache.
     makeMessage = state.unknownCorpus.makeMessage
     message = makeMessage(msg.getId(), msg.as_string())
     state.unknownCorpus.addMessage(message)
"""

For the user interface, you can just create a  
UserInterface.UserInterface subclass (needs a Home page/method, and  
an __init__ method).  Actually, you probably want  
ProxyUI.ProxyUserInterface as-is, with a different set of options to  
offer in the configuration pages (the parm_ini_map and adv_map used  
in the __init__).  (There would be a "No POP3 proxies running"  
message on the main page, but you could ignore that or subclass  
appropriately).

It's a long time since I've worked with the browser interface code,  
but I'm pretty sure that this would give you what you want.

Cheers,
Tony


More information about the spambayes-dev mailing list