[Python-Dev] Getting started with GBayes testing
Brad Clements
bkc@murkworks.com
Thu, 05 Sep 2002 10:13:50 -0400
On 4 Sep 2002 at 20:24, Guido van Rossum wrote:
> Pretty soon, a SF propject will be created (Barry has already gotten
> the request in). We'll gladly add you to the list of developers.
I look forward to it.
> > I'm particularly intersted in how to allow html only messages
> > (reduce false positives). I'm getting a lot of personal mail in
> > that format, unfortunately.
>
> You train it with an equal number of spam and non-spam ("ham") that
> you received. Just make sure the ham training messages contain enough
> representatives of the html-only mail.
This is one way to do it, but I was planning on experimenting with tokenizer methods
that strip out HTML tags, leaving only the text.
My feeling is that the presentation of "the message" is independent of the message
itself, so if I get a message in Text, HTML, RTF only the actual content is important, not
the markup method. Though I suppose using lots of red and large fonts might be an
indicator of spam, the text of the message should still suffice.
Tim's comments in timtest.py hint that stripping tags isn't a catastrophe for f-n's, but
he's not planning on doing that for use on technical lists.
I would like to pursue general client-side filtering of spam, so I do need to contend with
that.
btw, Tim's comment:
> # So if a message is multipart/alternative with both text/plain and text/html
> # branches, we ignore the latter, else newbies would never get a message
> # through. If a message is just HTML, it has virtually no chance of getting
> # through
Tells me (spammer hat on) that I can send message with a non-spammish text only
part, and a spam html part since most "non-techie" email client users automatically
display the html version when available, however Tim's implementation will ignore it.
Most "average users" never even see the text-only part of multipart messages. In Tim's
application, that's okay since he's going to use the text-only part anyway. But for my
purposes, I need to consider both portions. So it's simpler for me to strip html and
combine that text with the text-only part and then "test" the combined parts.
Well these are just musings, I'll be looking for the SF project.
-Brad
Brad Clements, bkc@murkworks.com (315)268-1000
http://www.murkworks.com (315)268-9812 Fax
AOL-IM: BKClements