[Tracker-discuss] On spammish submissions

Wed Mar 21 07:56:05 CET 2007

On 3/20/07, skip at pobox.com <skip at pobox.com> wrote:
> Okay, now that Brett straightened me out on access to the tracker I think I
> might actually be able to use bugs.python.org.  (It looks a lot different
> after successfully logging in.)
>
> On the topic of suppressing spammy form submissions, I implemented a simple
> scheme using SpamBayes for the Mojam and Musi-Cal sites.  The basic idea is
> pretty simple.  The first thing that happens for any submission is that the
> contents are converted to a stream of tokens (generally the "words" of the
> submission), with synthetic tokens added which indicate something about the
> quality of the submission.  For example, in my case the spammers were
> hitting the concert submission form.  Whether or not I could find lat/long
> coordinates for the "city" was a pretty good differentiator of valid and
> spammy submissions.  Also, the presence of URLs in most spammy submissions
> allowed SpamBayes to pick them apart and generate various useful tokens.
> Once tokenized, the submission was scored by SpamBayes on a scale from 0.0
> (definitely valid) to 1.0 (definitely spam).  If it fell under my predefined
> ham threshold (0.15 worked fine) I accepted it.  If not, I mailed the
> original input along to myself for later review.  If it scored less than the
> spam threshold (I chose 0.60) I would add it to the spam data for later
> retraining.  If above, I just tossed it out.  Similarly, if something scored
> above 0.15 but was actually okay (often with an input error) I would correct
> any errors, toss the input into the ham data, retrain and resubmit it.
>
> The same basic approach would work here.  Admins would get emails for
> anything above the "okay" threshold and could decide what, if anything,
> needed doing.  In the Admin menu you might have "Edit Spam", "Edit Ham" and
> "Retrain" items.  If necessary you'd add the questionable input to the spam
> or ham data and poke the retrain button.
>
> If there's interest in the technique I can work up some more concrete code
> which can be integrated into the tracker.

Hmm, sounds reasonable.  And the Bayesian network thing should work
out well here since the subject matter, as a whole, is rather
specific.  =)  Otherwise we are going to need captchas or something
for activating accounts or something.  Plus a Python solution is just
nice.

-Brett