[Python-Dev] Getting started with GBayes testing

Brad Clements bkc@murkworks.com
Thu, 05 Sep 2002 10:13:50 -0400


On 4 Sep 2002 at 20:24, Guido van Rossum wrote:

> Pretty soon, a SF propject will be created (Barry has already gotten
> the request in).  We'll gladly add you to the list of developers.

I look forward to it.

> > I'm particularly intersted in how to allow html only messages
> > (reduce false positives).  I'm getting a lot of personal mail in
> > that format, unfortunately.
> 
> You train it with an equal number of spam and non-spam ("ham") that
> you received.  Just make sure the ham training messages contain enough
> representatives of the html-only mail.

This is one way to do it, but I was planning on experimenting with tokenizer methods 
that strip out HTML tags, leaving only the text. 

My feeling is that the presentation of "the message" is independent of the message 
itself, so if I get a message in Text, HTML, RTF only the actual content is important, not 
the markup method. Though I suppose using lots of red and large fonts might be an 
indicator of spam, the text of the message should still suffice.

Tim's comments in timtest.py hint that stripping tags isn't a catastrophe for f-n's, but 
he's not planning on doing that for use on technical lists.

I would like to pursue general client-side filtering of spam, so I do need to contend with 
that.

btw, Tim's comment:


> # So if a message is multipart/alternative with both text/plain and text/html
> # branches, we ignore the latter, else newbies would never get a message
> # through.  If a message is just HTML, it has virtually no chance of getting
> # through

Tells me (spammer hat on) that I can send message with a non-spammish text only 
part, and a spam html part since most "non-techie" email client users automatically 
display the html version when available, however Tim's implementation will ignore it.

Most "average users" never even see the text-only part of multipart messages. In Tim's 
application, that's okay since he's going to use the text-only part anyway. But for my 
purposes, I need to consider both portions. So it's simpler for me to strip html and 
combine that text with the text-only part and then "test" the combined parts.

Well these are just musings, I'll be looking for the SF project. 

-Brad


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements