[Spambayes] Feature idea.

Tim Peters tim.one at comcast.net
Sun Feb 22 22:49:57 EST 2004


[John Gagon]
> I really love SpamBayes and have sold my whole company on it. It
> works great even for those ditzy blondes in reception who tend to
> try to unsubscribe to every spam they get in their box.
>
> I have a suggestion for Spam Bayes. This is regards to the
> threshold feature. ie: you can raise or lower the score filtering
> criterias. (ie: for Spam/Unsure and Inboxes etc)
>
> Over time, I would suspect the messages, statistically would create
> a "camel" two hump curve. ie: two sets of distributions (I know
> there is a more technical term for that in statistics but it slips my
> mind atm).

The distribution is bimodal, but not really like camel humps.  See the third
graph at:

    http://spambayes.sourceforge.net/background.html


> Over time, the humps would grow and the minima shift left a little as
> more and more clever spams are eliminated to the right side of the
> distribution)
>
> I would suspect the best place to set your thresholds would be
> between the ham and spam distribution humps.

As the graph shows, "the humps" (modes) are typically at 0.00 (rounded) and
1.00 (rounded), so "between them" is certainly good advice, but also advice
impossible not to follow <wink>.

> Or have your unsure zone be so many points away from that minima. It
> would be nice then to have a checkbox to enable automatic adjustment
> of the filtering criteria. (ie: over time, mine has gone down from 75%
> spam scores and above to 15% and above since I have a large hump
> after 15% and a smaller ham hump before the 15% mark. (IOW, the
> filter is getting very good and goes lower as it goes. but I'm
> having to manually do statistics and adjust the filter so as to get
> very good accuracy out of spambayes.

I don't see much hope for auto-adjustment:  email mixes vary wildly across
people; personal tolerances for FP, FN, and Unsure rates vary similarly; and
training strategies vary almost as much.  The SpamBayes scoring algorithm
also systematically scores perfectly ambiguous messages at 0.50 on the nose
(which accounts for the third-highest hump, near 0.50 in the graph).  I
prefer to call those Unsure.  Sounds like you prefer to call them Spam.  I
expect *most* people who don't like Unsure want to call them Ham, to avoid
turning an ambiguous msg into a false positive.




More information about the Spambayes mailing list