[spambayes-dev] subjective assessment of bigrams

Skip Montanaro skip at pobox.com
Wed Jan 7 08:05:12 EST 2004

    Toby> Ive been using bigrams since 2003-12-18, and thought you may be
    Toby> interested in some subjective feedback. I am using my
    Toby> overnight-train-on-everything regime, with 14000 hams and 2000
    Toby> spams.

Wow!  Any chance you could whack off the oldest 12,000 or so hams to bring
your ham:spam ratio back into balance?

    Toby> * My database size grew from 10M to 80M. Overnight training runs
    Toby>   extended from 5 minutes to 20 minutes

This isn't surprising given the number of messages in your database.
Bigrams *will* bloat your database.  I think that to use them effectively,
you should probably run with a fairly small training database.  I have a bit
over 500 each of ham and spam at this point (I've been experimenting with
some automatic training, so my database grew considerably until I figured
some things out) and currently have a DBDictClassifier database of 10.6MB.
The pickle file grows in proportion roughly linear to the number of keys in
the dictionary, while the DBDictClassifier file grows in marked jumps,
roughly doubling when it needs to resize, then remaining nearly constant in
size until a fairly large number of new keys are added.

    Toby> * A much larger proportion of spams now score 0.99 or over (I
    Toby>   filters these into a folder that I never normally look
    Toby>   at). Spams that score 0.98 or lower I filter into a 'probable
    Toby>   spam' folder and check manually every week; I am seeing a much
    Toby>   smaller proportion of messages in this category.

I have been using bigrams for awhile as well and find a lot more spam winds
up with an 0.99/1.00 score (which after a few days of checking I reroute to
/dev/null).  I've been lazy the past couple days though, and haven't paid
any attention to my unsures or probable spam files.  (I have enough other
"good" mail to read to keep me busy, thank-you-very-much.)

    Toby> * I have seen a qualitative change in the type of spam that gets
    Toby>   classified as unsure. Most of my unsures used to be very small
    Toby>   messages, spams selling something I might otherwise be
    Toby>   interested in, or other ones where 'unsure' made sense. It had
    Toby>   never missed a nigerian or porn spam for many months....  until
    Toby>   I enabled bigrams. With bigrams, a few have scored between 0.50
    Toby>   and 0.55. I tried untraining some of them, then reclassifying
    Toby>   with bigrams turned off; they all scored above 0.90.

This hasn't been a problem with me, but it's not entirely surprising.  The
Nigerian spams tend to be a lot more chatty than most sales pitches.  I
think they have would tend to have many bigrams that would turn up in
regular text, but not in standard late-night-auto-sales-commercials type of
text most spam uses.


More information about the spambayes-dev mailing list