[Spambayes] How well does sb_imapfilter.py work?

Tony Meyer tameyer at ihug.co.nz
Thu Aug 19 01:29:46 CEST 2004


> I tried running sb_imapfilter.py -b and setup my 
> configuration. I then ran sb_imapfilter.py -t to train. It 
> took a very long time ... and then it just died.

Stuff about the dying is at the end of this message.  Taking a long time -
you were processing 1200 messages, which involves retrieving the message
from the server and writing it back once, so that can take a while.  I don't
know what "a very long time" is, of course, or how fast 'fast' is in terms
of the connection.  You're unlikely to often train on that many messages (my
whole database is less than 600 messages, spam *and* ham), so it wouldn't
normally be a problem (and typically sb_imapfilter would run in the
background, either with the -l option or via a cron script, so you wouldn't
even notice training).

> I looked at 
> the stats and it showed that about 600 of each type (spam and 
> ham) had been trained, though, so I figured I could try 
> running it against my inbox using sb_imapfilter.py -c. I 
> noticed after a while that it hadn't moved any messages to 
> the spam or unsure folders, but that there were a lot of 
> messages being duplicated in the inbox (so I stopped it).

With the 1.0 sb_imapfilter messages are duplicated.  IMAP is a terrible
protocol - you can't edit messages, and you can't move them.  You can't even
delete them (just mark them for deletion and delete *all* messages so marked
in a folder).  sb_imapfilter writes a new version of each message it sees
with an ID header (the 1.1 sb_imapfilter does not do this in almost all
cases).  When messages are classified, it also writes another copy (1.1
still needs to do this), either in the Inbox (it has the classification
headers) or in the unsure/spam folder.  The old versions are marked for
deletion (your mailer may or may not indicate this to you).

You can get sb_imapfilter to purge the mailbox (deleting messages marked
with the /Deletion flag) as it goes, but this will delete any messages that
you have yourself marked for deletion, too.  It's also not undoable, so it's
wise to make sure that sb_imapfilter is running probably before you turn
that on.

I don't know why mail wasn't turning up in the unsure/spam folder (unless
you simply hadn't come across any non-ham mail yet).  Testing sb_imapfilter
on a folder with just a few messages (including some spam) would be a good
idea.  You can also turn on the evidence/clues header, and look at that to
see why messages were classified as they were.  The output of the script
will also say how many messages were classified as each type.

> So, before I continue with my experiments ... has anyone had 
> any luck with the IMAP filter?

Some people, yes.  It is the youngest of the main scripts, and I suspect the
least used, so it does have more rough edges.  Patches are always gratefully
accepted!

> I also tried running the script in Linux, but it doesn't seem 
> to like to run unless you're root, and the Web server isn't 
> loading.

There shouldn't be any need to run sb_imapfilter.py as root.  What happened
when you tried?  Perhaps non-root doesn't have access to Python (which would
be odd)?

Is port 8880 busy, perhaps?  You can use '-o html_ui:port:8881' on the
command line to change the port to (eg) 8881 or anything else you like.

> I'm trying to figure out where it's looking for the 
> config file now so hopefully I can avoid the Web interface altogether.

If you use the '-t' or '-c' options on the command line with
sb_imapfilter.py the web interface doesn't start up.  The configuration file
is found either in the location specified by the BAYESCUSTOMIZE environment
variable, if you have set it up, or a file bayescustomize.ini in the current
directory, or a file .spambayesrc in your home directory, or (with Windows
only) a file SpamBayes\Proxy\bayescustomize.ini in your Windows 'Application
Data' directory.

> Also, out of curiosity, has anyone compared the efficacy of 
> Spam Bayes with DSPAM? That's the other software package I'm 
> going to be trying out.

Not to my knowledge (any I've seen very few filter comparisons worth
anything.  The most typical problem when one of the compared filters is
SpamBayes is not dealing with the 'unsure' range properly (whatever
'properly' might be <wink>)).  I'm sure people would be interested if you
wanted to post comparisons here.

> SpamBayes IMAP Filter Version 0.4 (May 2004)
> and engine SpamBayes Engine Version 0.3 (January 2004).
[...]
> TypeError: string payload expected: <type 'list'>

This is odd.  For some reason sb_imapfilter managed to get the message and
turn it into a message object (i.e. parse it) but then when turning it back
into a string (to put back on the IMAP server) it choked on a malformation.
The error is meant to occur earlier (where it is caught and handled).

This should only occur with rare messages, typically spam, that arrive
malformed in some way.  If sb_imapfilter does stop, you should be able to
just start it up again and it'll continue from where it was up to (or
possibly it will immediately choke on that message again, in which case
you'll have to move that one out of the way).

You can open a bug report <http://sf.net/projects/spambayes> about this if
you like (please include all the traceback that you posted here).  I'll get
to it when I can (but I'm away for 3.5 weeks from today, so it won't be for
a while).  IAC when a 1.1a1 SpamBayes release comes out, there are many
sb_imapfilter improvements, so this might be handled by those.
Alternatively, using Python 2.4 would remove this problem, because the email
parsing is more robust.

=Tony Meyer

---
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.



More information about the Spambayes mailing list