[spambayes-dev] Tricky false positive: US states

Richie Hindle richie at entrian.com
Fri Oct 3 05:12:01 EDT 2003


Here's an interesting false positive: I asked an American colleague a
question about US state codes, and he emailed me a copy of this page from
the US Post Office website:

  http://www.usps.com/ncsc/lookups/usps_abbreviations.html

Now that scored as pretty solid spam for me (0.99075) because all the
state names are slight spam clues - most of my spam comes from the USA.
Here's a snippet of the X-Spambayes-Evidence header:

  'lock': 0.73; 'louisiana': 0.73; 'marshall': 0.73; 'missouri': 0.73;
  'mount': 0.73; 'nebraska': 0.73; 'ohio': 0.73; 'parkway': 0.73;
  'pennsylvania': 0.73; 'plz': 0.73; 'rad': 0.73; 'square': 0.73;
  'tennessee': 0.73; 'texas': 0.73; 'trl': 0.73; 'valley': 0.73;

and so on.  All those fifty sightly-spammy state names add up to a big
spam score.

Most of them are hapaxes, but that's not very relevant - it's just a
result of not having a very big training set (~600 messages).

Not sure whether there's anything we can do about it (or even whether we
should consider doing anything about it) but I thought it was interesting.

[ Ah, no, hang on, I *do* have an idea, but it's mostly outside the remit
  of Spambayes.  Mail that never went outside my organisation shouldn't be
  marked as spam.  All the Received headers show the mail moving within my
  organisation.  So I want some kind of plug-in system whereby I can use
  the Spambayes tokeniser, header analysis and so on to make my own
  decisions that override the classifier.  Once my army of winged monkeys
  has finished their Python training course I'll get them onto it. ]

-- 
Richie Hindle
richie at entrian.com




More information about the spambayes-dev mailing list