[spambayes-dev] Trigraphs as indicators of invalid subject words
Skip Montanaro
skip at pobox.com
Fri Jun 6 15:45:58 EDT 2003
[ ... on using trigraphs as clues to identify bogus words in message
subjects ... ]
>> Now you could turn things around and say the subject contained an
>> invalid word. That might be a useful clue for Spambayes.
Scott> That was my idea. Find a way to use the non-wordness to
Scott> penalize, rather than favor a message.
I tried it and found it had essentially no effect. That doesn't mean it
isn't a good idea. It's just that Spambayes is already so good that there
isn't much room for improvement. I just ran a 10x10 cross validation test
using 500 spams and 500 hams in each test set. It trained on 9 sets each
(4500 messages) of hams and spams then tested against the remaining one set
of each, then repeated choosing a different set to be the test. Over all
runs it scored 16 hams incorrectly (false positives - 0.32%), scored 40
spams incorrectly (false negatives - 0.80%) and was unsure about 573
messages (5.73%). When I added in Scott's idea implemented as a synthetic
"subject:invalid word" token, the false positives and false negatives didn't
change. The unsures crept up to 574.
This was run on a new training database (12700+ hams and 8600+ spams) which
I haven't exhaustively combed for errors, so it's possible there are still
some mistakes of mine in there (placing a ham message in the spam training
set for example), but it is essentially the same data which I use to train
Spambayes and classify messages on a daily basis, so I think it's fairly
clean.
Skip
More information about the spambayes-dev
mailing list