RE: [Spambayes] big imapfilter.py problem
File "./spambayes/classifier.py", line 301, in probability assert hamcount <= nham AssertionError What's the problem now?
(Excuse the verbosity of the answer). I've established what the problem here is, I believe. The RFC says that a message's UID must be unique within the mailbox. I was not thinking clearly enough when I read that to realise that in the context of the RFC, a mailbox is a folder, not the whole collection of folders. This means that to get a stable, unique identifier for a message isn't possible - an unstable unique identifier could be obtained by combining the UID and the folder's UID validity value, but that isn't guaranteed to stay the same over different sessions. This explains why not everyone sees this. One of the servers I use has unique numbers for any message (judging from what I've seen) - I'm not sure about the other, but it might also - it's up to the server to decide how the UIDs are allocated. On the other hand, if you try to untrain the wrong message, you'll get lots of ham/spam count errors. So I'm going to give up using the UID (in any form) as an identifier for messages, and do what all the other spambayes apps (bar Outlook) do and add my own. I'll store this with the message whenever it's saved. This will mean things will be a little slower (have to search for messages with a header with a certain value, instead of for a message with a particular uid), but slow and working is better than fast and not. I'll have to rework a reasonable chunk of imapfilter to do this. It will also mean that the message info databases (spambayes.messageinfo.db, probably) will be invalid (the ids will be changing), although they should still work. Any training done with imapfilter is now suspect, so I wouldn't advise keeping hold of those db's (hammie.db etc). I'll try and do this ASAP, but real life is keeping me a bit busy over the next couple of days. In other notes, I've found the problem that was causing the __cmp__ error and fixed it (I committed another error at the same time, but I'll check in a fix for that soon). I've also found a place where things are much slower than they need to be, so performance gains are still easily possible. =Tony Meyer
"Meyer, Tony" wrote So I'm going to give up using the UID (in any form) as an identifier for messages, and do what all the other spambayes apps (bar Outlook) do and add my own. I'll store this with the message whenever it's saved. This will mean things will be a little slower (have to search for messages with a header with a certain value, instead of for a message with a particular uid), but slow and working is better than fast and not.
Note that a number of IMAP servers out there support caching of headers. I know we locally configured cyrus to cache the headers that we care about at work. Anthony -- Anthony Baxter <anthony@interlink.com.au> It's never too late to have a happy childhood.
Meyer, Tony wrote:
So I'm going to give up using the UID (in any form) as an identifier for messages, and do what all the other spambayes apps (bar Outlook) do and add my own. I'll store this with the message whenever it's saved. This will mean things will be a little slower (have to search for messages with a header with a certain value, instead of for a message with a particular uid), but slow and working is better than fast and not.
Performance is fine for me at the moment. e.g. this morning 60 messages were classified in 40s. I'm intending to have imapfilter running full time in the background once everything's working, so it will just need to do 2 or 3 messages every 10 minutes. So, performance isn't a problem.
Any training done with imapfilter is now suspect, so I wouldn't advise keeping hold of those db's (hammie.db etc).
Again - not a problem. The way things stand I'm having to delete/retrain/edit the DB every couple of days anyway. BTW, the IMAP server I'm using is Cyrus v1.6.24. Could be useful to know. Olly
participants (3)
-
Anthony Baxter -
Meyer, Tony -
Oliver Maunder