[Tracker-discuss] Some observations about the spam filter

skip at pobox.com skip at pobox.com
Tue Aug 12 02:42:43 CEST 2008


I just worked my way through the current pile of SpamBayes messages.  There
were actually a couple spams.  (At least I'm fairly certain they were spam.
They were in French, didn't appear to have anything to do with Python and
were in HTML format.)

A couple things jumped out at me:

    1. It looks like synthetic tokens are being generated in both
       detectors/spambayes.py and extensions/spambayes.py.  They both have
       somewhat different versions of an extract_classinfo() function.  Can
       we get away with a single version of that function?

    2. Many messages mention a Subversion revision number.  These are almost
       always different.  We should generate a synthetic token which
       indicates whether or not a submission contained what looked like a
       revision.  I'll check something in for that shortly once I understand
       how I should deal with item #1.

    3. If the body of the message was "My dog has fleas." it would be
       presented to the spam filter as "content:My dog has fleas."  That is,
       the first word is always prefixed by the string "content:".  I can't
       tell where that's getting applied, but we should get rid of it.

Skip


More information about the Tracker-discuss mailing list