[Tracker-discuss] Some observations about the spam filter
brett at python.org
Sun Aug 24 20:19:41 CEST 2008
On Sun, Aug 24, 2008 at 11:02 AM, <skip at pobox.com> wrote:
> On August 11 I wrote:
> me> I just worked my way through the current pile of SpamBayes messages.
> me> There were actually a couple spams. (At least I'm fairly certain
> me> they were spam. They were in French, didn't appear to have anything
> me> to do with Python and were in HTML format.)
> me> A couple things jumped out at me:
> me> 1. It looks like synthetic tokens are being generated in both
> me> detectors/spambayes.py and extensions/spambayes.py. They both
> me> have somewhat different versions of an extract_classinfo()
> me> function. Can we get away with a single version of that
> me> function?
> me> 2. Many messages mention a Subversion revision number. These are
> me> almost always different. We should generate a synthetic token
> me> which indicates whether or not a submission contained what looked
> me> like a revision. I'll check something in for that shortly once I
> me> understand how I should deal with item #1.
> me> 3. If the body of the message was "My dog has fleas." it would be
> me> presented to the spam filter as "content:My dog has fleas." That
> me> is, the first word is always prefixed by the string "content:".
> me> I can't tell where that's getting applied, but we should get rid
> me> of it.
> I've not seen a reply about this. I realize Martin is on holiday. Has
> anyone else who has seen this note got an opinion? I created issue 215 with
> a patch for detectors/spambayes.py to add a hasrev token:
I personally don't know enough about SpamBayes or the Roundup setup to
have an opinion. But basically it all sounds fine with me as long as
the spammers don't realize what we are doing.
More information about the Tracker-discuss