[Spambayes] question regarding training
Coe, Bob
rcoe at CambridgeMA.GOV
Wed Aug 11 16:06:42 CEST 2004
> From: spambayes-bounces at python.org
> [mailto:spambayes-bounces at python.org]On Behalf Of Tony Meyer
> Sent: Tuesday, August 10, 2004 2:32 AM
> To: 'Missy'; spambayes at python.org
> Subject: RE: [Spambayes] question regarding training
>
>
> > I have noticed that on my Spambayes manager, it
> > has way more spam than ham. It also states that it
> > works best when there are equal amounts of both.
> > What can I do to make it work more efficiently?
>
> This is getting to be a FAQ!
My sense is that when users have an imbalance problem, overwhelmingly the situation is that of this user, i.e. more spam than ham. I'm about to say a couple of things that depend on that assumption, so I just want to state it.
> Firstly, if you are not already, then doing "train on mistakes" is a good
> idea. Basically, the only training you do is on mail that ends up in the
> 'unsure' folder, and any false positives (good mail in spam folder) and
> false negatives (vice versa), if there are any. This should reduce the
> imbalance, and make it grow less quickly.
I don't see why. The expectation should be that users will tune their cutoff values so that most of what goes into the unsure folder is spam. If a user then processes every unsure message into the database, this will increase, not decrease, the imbalance.
> If you get a lot of mail in the 'unsure' folder, you can adjust the
> thresholds (Filtering tab), to try and reduce it.
>
> If you get multiple copies of a spam message, don't "Delete as spam" all of
> them, just one, and move the rest to the spam folder (or Deleted Items)
> manually.
Depending (possibly) on your settings, moving messages to the spam folder, even manually, will process them into the database. Right?
> Don't worry too much about the imbalance as long as things are working well
> enough. Particularly if it's a small imbalance (like 3::1) rather than a
> large one (like 100:1).
>
> (Longer term, the developers are trying to figure out ways to help people
> with this problem, but that's a way off yet).
I'm gonna climb on my soapbox here, even while admitting that I don't know the first thing about Spambayes's actual implementation.
To me, the solution to the problem seems obvious and almost absurdly easy to implement: When the imbalance reaches a certain level (determined by the Spambayes gurus), have the program start training on every nth message it classifies as ham. Do this until the desired balance is restored. Yes, there's a bit of a feedback loop here, in that Spambayes is merely validating its own conclusions. But the users' passivity (or lack thereof) serves as a check on the process. In other words, if Spambayes incorporates a message it misclassified as ham, the user will reclassify it as spam, which will reverse that message's effect on the database.
I don't know how big the problem really is. My database is just over 2 to 1 spam, well within Tony's definition of "small", and I have no classification problems worth mentioning. But for those cases where the imbalance has become large, I suggest that my idea may be worth trying.
Bob
MIS Department, City of Cambridge
831 Massachusetts Ave, Cambridge MA 02139 · 617-349-4217 · fax 617-349-6165
More information about the Spambayes
mailing list