[spambayes-dev] Any MoinMoin experts here?
skip at pobox.com
skip at pobox.com
Sat Jan 27 03:47:11 CET 2007
Is there anyone here with experience working with the MoinMoin code base? I
think using SpamBayes to deflect spam instead of the current
BadContent/LocalBadContent approach would be useful. I wrote a couple
messages to the moin-users mailing list, but received no responses. (In
scanning the archive I don't see my message. Must have disappeared in a
black hole.) In case someone's interested, here's what I wrote in my second
post:
We all know wikis get spammed. I'm not up-to-speed on the latest
versions of MoinMoin, but I think the concept used at least through the
1.3 series (the use of BadContent and LocalBadContent pages) is
fundamentally flawed since it relies on the users to manually update
"bad" words. You're always trying to catch up with the spammers.
Instead, let me suggest that you incorporate a SpamBayes-based
classifier into MoinMoin. I did this recently for a couple other
websites I manage (Mojam and Musi-Cal - not wikis). It worked
marvelously there. I now reject 100% of the spam submissions and also
catch submission mistakes by good users that I would never have caught
before.
Here's how I envision it working. Whenever a form submission happens
the new page is scored against the current SpamBayes database. If it
scores as possible or probable spam, it is automatically reverted back
to the last revision that scores as okay, and the full URL for that
revision is mailed to all people in AdminGroup. An admin reviews that
URL. If it's okay, the URL is added to the HamPages page. If not, it's
added to the SpamPages page (both suitably protected for AdminGroup
write only and not themselves checked by SpamBayes). Whenever those
pages are saved the entire database is retrained from scratch. This
should not generally be a problem, as there will probably only be a few
pages in the database, so retraining should be quick. It should also be
a relatively rare occurrence. If the suspect page was actually ham,
after retraining, score it again. It should score as ham now. If so,
just revert to it. If not, add it to the HamPages page a second time.
I'm not entirely sure how to handle new pages which are spam, but I
think you should be able to automatically DeletePage them, then revive
them later if they turn out to be good.
This all said, I can help from the SpamBayes side of things (write the
tokenizer, suggest some synthetic tokens that might help improve the
discrimination of ham and spam), but I'm not familiar with the MoinMoin
code base, certainly not the latest versions. It's unlikely that I
could implement it quickly on that side of things. If someone familiar
with MoinMoin's code base would like to team up with me on this, let me
know. Together we should be able to knock this off very quickly.
Skip
More information about the spambayes-dev
mailing list