On Wed, May 24, 2006 12:48, skip at pobox.com said:
>     Amedee> However, English is not my mother language and most of my
>     Amedee> correspondence is in Dutch.  As a consequence, most common
>     Amedee> English words are quite uncommon for me. The result is that
>     Amedee> common English words will score a bit above 0.5. Perhaps not
>     Amedee> much, but enough to be significant after a while.
> Thanks, I didn't realize that.  Do you have an example in your training
> database you can share with us (both message and word scores) where you
> think the English disclaimer text has tipped the scales and caused a ham
> message to later be scored as spam?  If you simple train on one or two of
> those misclassified hams does the problem go away?  How skewed is your
> training database (number of spams vs number of hams)?  Have you
> considered
> throwing out your current training database and starting fresh?
> One thing that might help is to further break messages which score as spam
> into "low" and "high" spam.  Based on my current settings that gives me
> these four categories:
>     ham         0.00-0.14
>     unsure      0.15-0.59
>     low spam    0.60-0.74
>     high spam   0.75-1.00
> High spam is tossed without further consideration.  Ham is sorted in the
> appropriate mailbox by procmail.  Unsure and low spam messages each wind
> up
> in their own mailboxes for further consideration.  I train on most unsure
> messages but only train on lospams which are actually ham.
> My suspicion is that if you have ham messages which are erroneously
> winding
> up as spam they are at the very low end of the spam scale.  It might be
> sufficient to move your spam threshold up a bit so they are more likely to
> land in the unsure category.
> Skip


I think you have hit the mark there.
I already use something like your lospam/hiham.
I have 5 categories: high ham, low ham, unsure, low spam, high spam
The high ham/spam respectively go to procmail or /dev/null.
And indeed, the misclassified hams all wind up in unsure or low spam.


