[Python-Dev] Getting started with GBayes testing

Thu, 05 Sep 2002 13:57:17 -0400

[Followups directed to spambayes@python.org
 http://mail.python.org/mailman-21/listinfo/spambayes
]

[Brad Clements]
> ...
> My feeling is that the presentation of "the message" is independent of the
> message itself, so if I get a message in Text, HTML, RTF only the actual
> content is important, not the markup method.

Everything's A Clue.  Everything that gets ignored partly blinds the
classifier, so the question isn't whether there's a difference, it's how
much of a difference it makes.

> Though I suppose using lots of red and large fonts might be an
> indicator of spam, the text of the message should still suffice.

Indeed, Graham reported that the hex color code for bright red was one of
the strongest spam indicators in his database.

> Tim's comments in timtest.py hint that stripping tags isn't a
> catastrophe for f-n's, but he's not planning on doing that for use on
> technical lists.

When HTML-only email is a 99.99% spam indicator on a tech list, it would be
crazy to ignore that clue.  But note that the comments *also* say I'd be
delighted to remove HTML tags even there if some other way of slashing the
f-n rate is proven to work (and most people who have tried it say that
mining more header lines does do it -- but then I haven't seen anything from
them about how they do when they ignore the header lines.  I was happy to
ignore header lines in order to get *some* kind of handle on how well could
be done on "pure content", and turned out that works remarkably well).

>> # So if a message is multipart/alternative with both text/plain
>> # and text/html branches, we ignore the latter, else newbies would never
>> # get a message through.  If a message is just HTML, it has virtually no
>> # chance of getting through

> Tells me (spammer hat on) that I can send message with a
> non-spammish text only part, and a spam html part since most
> "non-techie" email client users automatically display the html
> version when available, however Tim's implementation will ignore it.

Sure.  It *certainly* isn't a problem on my test data (as witnessed by the
measured error rates).  If the nature of the world changes, the code has to
adapt along with it.  But 90% of the spam I receive (and I get a lot) is
still trivial to recognize from a mere glance at the subject line, and I
don't buy that spammers are a class of ubergeek with formidable skill.
Response rates are a percentage game, and more so than anti-spammers I
expect spammers are keen to go for high-percentage wins at the expense of
esoterica.

> Most "average users" never even see the text-only part of
> multipart messages. In Tim's application, that's okay since he's going
> to use the text-only part anyway. But for my  purposes, I need to consider
> both portions. So it's simpler for me to strip html and combine that text
> with the text-only part and then "test" the combined parts.

Not unreasonable <wink>, but testing remains the only way to decide.  It's
rare you can out-think a fraction of a percent!