[Spambayes] big imapfilter.py problem

Mon Apr 28 18:52:26 EDT 2003

> >   File "./spambayes/classifier.py", line 301, in probability
> >     assert hamcount <= nham
> > AssertionError
> > What's the problem now?

(Excuse the verbosity of the answer).

I've established what the problem here is, I believe.  The RFC says that
a message's UID must be unique within the mailbox.  I was not thinking
clearly enough when I read that to realise that in the context of the
RFC, a mailbox is a folder, not the whole collection of folders.  This
means that to get a stable, unique identifier for a message isn't
possible - an unstable unique identifier could be obtained by combining
the UID and the folder's UID validity value, but that isn't guaranteed
to stay the same over different sessions.

This explains why not everyone sees this.  One of the servers I use has
unique numbers for any message (judging from what I've seen) - I'm not
sure about the other, but it might also - it's up to the server to
decide how the UIDs are allocated.  On the other hand, if you try to
untrain the wrong message, you'll get lots of ham/spam count errors.

So I'm going to give up using the UID (in any form) as an identifier for
messages, and do what all the other spambayes apps (bar Outlook) do and
add my own.  I'll store this with the message whenever it's saved.  This
will mean things will be a little slower (have to search for messages
with a header with a certain value, instead of for a message with a
particular uid), but slow and working is better than fast and not.

I'll have to rework a reasonable chunk of imapfilter to do this.  It
will also mean that the message info databases
(spambayes.messageinfo.db, probably) will be invalid (the ids will be
changing), although they should still work.  Any training done with
imapfilter is now suspect, so I wouldn't advise keeping hold of those
db's (hammie.db etc).  I'll try and do this ASAP, but real life is
keeping me a bit busy over the next couple of days.

In other notes, I've found the problem that was causing the __cmp__
error and fixed it (I committed another error at the same time, but I'll
check in a fix for that soon).  I've also found a place where things are
much slower than they need to be, so performance gains are still easily
possible.

=Tony Meyer