[spambayes-dev] subjective assessment of bigrams

Toby Dickenson tdickenson at devmail.geminidataloggers.co.uk
Wed Jan 7 06:15:16 EST 2004


Ive been using bigrams since 2003-12-18, and thought you may be interested in 
some subjective feedback. I am using my overnight-train-on-everything regime, 
with 14000 hams and 2000 spams.

* My database size grew from 10M to 80M. Overnight training runs extended from 
5 minutes to 20 minutes

* A much larger proportion of spams now score 0.99 or over (I filters these 
into a folder that I never normally look at). Spams that score 0.98 or lower 
I filter into a 'probable spam' folder and check manually every week; I am 
seeing a much smaller proportion of messages in this category.

* I have seen a qualitative change in the type of spam that gets classified as 
unsure. Most of my unsures used to be very small messages, spams selling 
something I might otherwise be interested in, or other ones where 'unsure' 
made sense. It had never missed a nigerian or porn spam for many months.... 
until I enabled bigrams. With bigrams, a few have scored between 0.50 and 
0.55. I tried untraining some of them, then reclassifying with bigrams turned 
off; they all scored above 0.90.

I am happy to experiment if anyone has any suggestions.

-- 
Toby Dickenson




More information about the spambayes-dev mailing list