[Tracker-discuss] Some observations about the spam filter

skip at pobox.com skip at pobox.com
Sun Aug 24 20:02:56 CEST 2008

On August 11 I wrote:

    me> I just worked my way through the current pile of SpamBayes messages.
    me> There were actually a couple spams.  (At least I'm fairly certain
    me> they were spam.  They were in French, didn't appear to have anything
    me> to do with Python and were in HTML format.)

    me> A couple things jumped out at me:

    me> 1. It looks like synthetic tokens are being generated in both
    me>    detectors/spambayes.py and extensions/spambayes.py.  They both
    me>    have somewhat different versions of an extract_classinfo()
    me>    function.  Can we get away with a single version of that
    me>    function?

    me> 2. Many messages mention a Subversion revision number.  These are
    me>    almost always different.  We should generate a synthetic token
    me>    which indicates whether or not a submission contained what looked
    me>    like a revision.  I'll check something in for that shortly once I
    me>    understand how I should deal with item #1.

    me> 3. If the body of the message was "My dog has fleas." it would be
    me>    presented to the spam filter as "content:My dog has fleas."  That
    me>    is, the first word is always prefixed by the string "content:".
    me>    I can't tell where that's getting applied, but we should get rid
    me>    of it.

I've not seen a reply about this.  I realize Martin is on holiday.  Has
anyone else who has seen this note got an opinion?  I created issue 215 with
a patch for detectors/spambayes.py to add a hasrev token:




