[Spambayes] Re: Spambayes Digest, Vol 52, Issue 26

T. Alexander Popiel popiel at wolfskeep.com
Mon Dec 16 12:16:53 EST 2002


In message:  <a05200f6cba23ccd3425e@[192.168.1.103]>
             Robert Woodhead <trebor@animeigo.com> writes:
>
>I was a bit surprised that you guys haven't run across the embedding 
>tricks before.  In my spam parsing, I have the parser spit out all 
>not only the words, but also the tokens internal to a tag (< and > 
>are considered whitespace), and catenate those words broken up by 
>tags.

>Seems to work well.  The state machine for doing this is trivial. 
>And the extra stuff you glean from the interior of tags is likely to 
>be significant.

We (I use the term loosely, since I didn't do any of the work)
did some stuff with paying attention to HTML tags.  You're right,
the effects _were_ significant: significantly bad.  It made it
impossible to talk _about_ specific HTML or send a mail in HTML
without being called spam.  The (highly correlated) HTML markers
all got associated so strongly with spam that any HTML presence
was instant damnation.

Depending on what sort of people send you mail, this may or may not
be a problem. ;-)

I suspect some interesting stuff could be done by deciding to pay
attention to all but a select set of HTML tags, while treating
<br>, <p>, <hr>, and other similar basic formatting tags as
whitespace.  It would be interesting to try to determine the set
of tags to ignore based on a collection of HTML ham vs. HTML spam...
but I don't have such a collection, and since I've already got a
0.6% unsure rate with no errors, I'm not too motivated.

I now well understand why Tim Peters lost interest in algorithm
tweaking; until the amount of spam leaking through increases by an
order of magnitude, I'm probably just going to ignore it, as I
ignored it for the five years before last summer.  The good is the
enemy of the perfect, too.

- Alex



More information about the Spambayes mailing list