[Spambayes] Many users on domain coming up as "possibly spam"

Wed Oct 20 22:33:24 CEST 2004

[Bob Coe]
>> I don't understand why this very simple and well
>> understood problem is so resistant to solution.

[Kenny Pitt]
> It was called the "experimental_ham_spam_imbalance_adjustment"
> option [...]

I believe that the reason that it is 'resistant' is because there isn't so
much testing done with SpamBayes anymore.  Developing a new adjustment
solution is certainly possible, but as well as the time needed to understand
& figure out the maths, it needs lots of people running tests (i.e.
timcv.py) and posting their results to be good enough to get into the
source.  Unfortunately, there aren't many people running the tests anymore,
and no-one is really looking at the math.

I think another factor is that AFAIK SpamBayes works well as is for the
developers, so there isn't really an incentive to work on this, even among
those that might try to work on the math (or look at other filters, to
borrow ideas).

The third factor, and the most telling for me personally, is that it appears
that the training regime has a huge effect on the accuracy, and so some sort
of balancing factor can be built into the training regime.  "Train to
exhaustion" (tte) is one such regime (and is used, quite successfully I
believe, by Skip), but there could be others.  Initial testing indicates
that although 'training on mistakes' (or fpfnunsure) is good, and better
than 'train on everything', it can be beaten by other methods.

Testing these is very time-consuming, though.  I've tried several
auto-balancing methods, running them through the incremental.py testing
script, and haven't found any (other than tte) that are successful.

All of this is before the changes make it into CVS, let alone the Outlook
plug-in (which is really designed for mistake based training), or a release
- and for quite a while the focus was getting a 1.0 release done, which was
bugfixes only, not new features.

I'm sure that 1.1a1 will have something to try and address this problem (as
an experimental option, probably).  What I'm personally working on is a way
to have sb_server/the Outlook plug-in work with different training regimes,
particularly tte, as I think that offers the most help.

Anyway, that's my take on why the 'resistance'.

=Tony Meyer

---
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.