[Spambayes]

Tue Dec 17 01:04:01 EST 2002

[Robert Woodhead]
> I was a bit surprised that you guys haven't run across the embedding
> tricks before.

I don't know that we haven't, just that only one such managed to get itself
classified as Unsure in my personal email so far.  That was discussed at
length here when it happened.  It got a high ham score for *me* because one
of the news stories it included was about the DC-area snipers, and since I
live in the area I had lots of ham from friends and relatives talking about
that too.  The other putative ham it included wasn't notably hammy to my
classifier, and would not have saved the msg from being called spam -- the
spammy parts were extremely spammy.

> In my spam parsing, I have the parser spit out all not only the words,
> but also the tokens internal to a tag (< and > are considered
> whitespace), and catenate those words broken up by tags.
>
> So
>
> foo<!-- derf -->bar foo<bork>baz <font color=#FFFFFF>
>
> results in output:
>
> derf foobar bork foobaz font color ffffff

OTOH, we go out of our way to strip almost all evidence of tags, lest every
HTML email be classified as spam.

> Seems to work well.  The state machine for doing this is trivial.
> And the extra stuff you glean from the interior of tags is likely
> to be significant.

For a long time we had an option not to strip HTML tags, because in the
early days my comp.lang.python test found that extremely helpful (not
surprising!  there are virtually no legit HTML msgs on tech mailing lists,
while lots of spam is HTML).  A result was that every one of the few legit
HTML c.l.py msgs became false positives, and a larger number of legit
non-HTML c.l.py msgs talking *about* HTML became FP.

As other parts of the algorithms improved, the advantage of these "killer
clues" eventually fell to nothing, and then below nothing because of their
bad effects on the FP rate.  This was all quantified by experiments at the
time.

Later I put a bit back in, to capture specific suspicious tags (like
"<script") and suspicious parameters (like "width=0").  This nailed a
particular class of extremely brief virus-related email I was getting at the
time, but didn't make any difference to large-test results.