[Tracker-discuss] spam auditor checked in

Wed Jul 25 13:08:49 CEST 2007

skip at pobox.com skrev:
>     Erik> The xmlrpc server has been installed on psf.upfronthosting.co.za
>     Erik> as detailed in your message, using a cvs checkout from an hour
>     Erik> ago. Seems to work.
>
> Note that we switched from CVS to Subversion a couple days ago.  I don't
> think there are any significant differences yet (only my trivial test
> checkins), but you should track the Subversion repository.  
Ah. Good thing :-). http://spambayes.sourceforge.net/download.html needs 
an update, though.
>     Erik> Now I think it needs training. Ideas on how to do that?
>
> Yes, there are two ways to train.  First, there are train and train_mime
> methods in the XML-RPC server.  Second, and certainly more convenient to
> start with,
I'm a programmer. For me, an xmlrpc interface is always more convenient 
than a web interface :-).

>  point your web browser at the URL the server displays when it
> starts up, probably http://localhost:8880/.
I got that running, yes. And I fully agree that it's better if the 
spambayes server is running on localhost, as we don't want too many 
external dependencies. As its now up and running on localhost, feel free 
to turn off the instance on www.webfast.com.

>     Erik> Also, I'm a bit confused on how the detector works - could you
>     Erik> explain the arguments the XMLRPC method expects? Is the first
>     Erik> argument supposed to be a string, or something else?
>
> The score method takes three arguments, a dictionary representing the form
> submission contents, a possibly empty list of extra tokens which you
> generate, and a list of attachment dictionaries.  See the docstring for
> spambayes.XMLRPCPlugin.form_to_mime.
>   
Ah! Now I understand how it works. I was looking in 
scripts/sb_xmlrpcserver.py which is installed in the bin/ directory. I 
should have been looking in XMLRPCPlugin.py. Is sb_xmlrpcserver.py 
perhaps deprecated and on the list of things to be removed?

> I also put my test script on the webfast server:
>
>     http://www.webfast.com/~skip/checkmimemsg.py
>
> My intention is that file uploads are transferred in the attachments
> dictionary as compound data while the normal form data are transferred in
> the form dictionary.  The extra_tokens list should consist of synthetic
> tokens your detector generates, such as "user:anonymous" or "user:skip" to
> indicate the login status or "userage:N" where N is something like the log
> of the number of seconds since the logged in user was registered.
>
> One thing I'm unclear how to do is to recover from a submission which is
> misclassified as spam.  You somehow need to recover the contents of that
> form from somewhere and resubmit the contents.  I sort of think this has to
> happen in the detector.
>   
Hmm.. In a complete system, I think it should work as follows:

*) An attribute, 'spambayes_score', is added to the file and msg classes 
(in schema.py). Guess what this attribute will hold.. :-). A boolean 
attribute 'spambayes_misclassified' should also be added.

*) A detector is added that reacts on instances of the file and msg 
classes. When it   fires, it contacts the Spambayes XMLRPC Server and 
gets a score based on the contents and some syntetical tokens)

*) The web pages of the tracker should be modified to not display file 
and msg instances that are classified as spam for anonymous users. 
Instead a message should be displayed that tells the user that the file 
or msg has been classified as spam, and that the user should login and 
press a button to alert an coordinator if the message is incorrectly 
classified.

*) The web pages should, for logged-in users, display a button that 
allows ordinary users to alert administrators that a msg/file is 
misclassified, by setting the 'spambayes_misclassified' attribute. A 
detector should send mail to coordinators when this happens.

*) For coordinators, the web pages should provide buttons for "train as 
ham" and "train as spam", and when one of these is pressed, the 
'spambayes_misclassified' bool should be set to false. For the training 
buttons to work, one or two new web actions are needed. They are written 
as python scripts in the extensions directory of the tracker.

*) The detectors sending e-mail to various e-mail lists (and to the nosy 
list) should not send mail when a message is classified as spam. 
However, if a message was misclassified as spam, they should in an ideal 
world re-send the message when the message is retrained as ham. The 
latter might be tricky, though.

*) Issues that only have msg/file instances that are spam should 
probably not be displayed in the tracker.

This is quite a lot of work, of course, especially if you're new to 
roundup. Let me think about this to <zxsee if we can come up with 
something simpler.

Regards,
\Ef