[Tracker-discuss] On spammish submissions

skip at pobox.com skip at pobox.com
Wed Mar 21 04:47:39 CET 2007

Okay, now that Brett straightened me out on access to the tracker I think I
might actually be able to use bugs.python.org.  (It looks a lot different
after successfully logging in.)

On the topic of suppressing spammy form submissions, I implemented a simple
scheme using SpamBayes for the Mojam and Musi-Cal sites.  The basic idea is
pretty simple.  The first thing that happens for any submission is that the
contents are converted to a stream of tokens (generally the "words" of the
submission), with synthetic tokens added which indicate something about the
quality of the submission.  For example, in my case the spammers were
hitting the concert submission form.  Whether or not I could find lat/long
coordinates for the "city" was a pretty good differentiator of valid and
spammy submissions.  Also, the presence of URLs in most spammy submissions
allowed SpamBayes to pick them apart and generate various useful tokens.
Once tokenized, the submission was scored by SpamBayes on a scale from 0.0
(definitely valid) to 1.0 (definitely spam).  If it fell under my predefined
ham threshold (0.15 worked fine) I accepted it.  If not, I mailed the
original input along to myself for later review.  If it scored less than the
spam threshold (I chose 0.60) I would add it to the spam data for later
retraining.  If above, I just tossed it out.  Similarly, if something scored
above 0.15 but was actually okay (often with an input error) I would correct
any errors, toss the input into the ham data, retrain and resubmit it.

The same basic approach would work here.  Admins would get emails for
anything above the "okay" threshold and could decide what, if anything,
needed doing.  In the Admin menu you might have "Edit Spam", "Edit Ham" and
"Retrain" items.  If necessary you'd add the questionable input to the spam
or ham data and poke the retrain button.

If there's interest in the technique I can work up some more concrete code
which can be integrated into the tracker.


