[spambayes-dev] Tricky false positive: US states
Richie Hindle
richie at entrian.com
Fri Oct 3 05:12:01 EDT 2003
Here's an interesting false positive: I asked an American colleague a
question about US state codes, and he emailed me a copy of this page from
the US Post Office website:
http://www.usps.com/ncsc/lookups/usps_abbreviations.html
Now that scored as pretty solid spam for me (0.99075) because all the
state names are slight spam clues - most of my spam comes from the USA.
Here's a snippet of the X-Spambayes-Evidence header:
'lock': 0.73; 'louisiana': 0.73; 'marshall': 0.73; 'missouri': 0.73;
'mount': 0.73; 'nebraska': 0.73; 'ohio': 0.73; 'parkway': 0.73;
'pennsylvania': 0.73; 'plz': 0.73; 'rad': 0.73; 'square': 0.73;
'tennessee': 0.73; 'texas': 0.73; 'trl': 0.73; 'valley': 0.73;
and so on. All those fifty sightly-spammy state names add up to a big
spam score.
Most of them are hapaxes, but that's not very relevant - it's just a
result of not having a very big training set (~600 messages).
Not sure whether there's anything we can do about it (or even whether we
should consider doing anything about it) but I thought it was interesting.
[ Ah, no, hang on, I *do* have an idea, but it's mostly outside the remit
of Spambayes. Mail that never went outside my organisation shouldn't be
marked as spam. All the Received headers show the mail moving within my
organisation. So I want some kind of plug-in system whereby I can use
the Spambayes tokeniser, header analysis and so on to make my own
decisions that override the classifier. Once my army of winged monkeys
has finished their Python training course I'll get them onto it. ]
--
Richie Hindle
richie at entrian.com
More information about the spambayes-dev
mailing list