[Spambayes] New web training interface for pop3proxy
Sun Nov 24 08:01:08 2002
> Another tag that is probably huge and worthless is
> <script>...</script>, often couched in a huge comment. (But do
> scripts even occur in emailed html?)
Yes, and especially in spam. The mere presence of <script and/or </script
generates tokens now (see virus_re).
> We should probably use another cheap ass gimmick to get rid of those
I checked in code to get rid of <style and <!-- gimmicks in a different way.
Leaving <script> guts alone allows the classifier to see common bits of spam
script, so it's probably helpful to leave those bits alone.
> then use the cheap ass regex to get rid of the rest of the html.
> One other problem with the regex that I see is that it doesn't
> seem to handle tags with ill placed whitespace very well... like < a
It doesn't handle them at all, and intentionally not, because it has no idea
whether it's even looking at HTML. Note that the most likely values
attached to href= are picked up anyway, though (scanning for embedded URLs
is done without regard to context -- whether attached to href= or src= or
just sitting in plain text or whatever, we tokenize 'em). The special tags
we look for (like <script> and <iframe>) do allow for leading whitespace,
because there's scant chance those will match other kinds of text by
> A whitespace normalization substitution regex might be well advised.
> Taking out whitespace after a < would change a < b to a <b, not
> altering its meaning from a clue perspective, and would change < a
> href=... to <a href=..., making it recognizable to the cheap-ass
> gimmick regex.
This very email shows why that's not advisable: it would pick up accidental
instances of "<" and consider them to be "HTML tags" ending with one of the
">"s I'm using to quote your text; or, for example, deleting everything from
like < a
I'm afraid real parsing can't be done by a cheap-ass gimmick.
> There was some talk earlier about gleaning clues from some tags, like
> background, font, color, etc. kind of things... any more thought along
> those lines?
Not here -- it wouldn't catch any spam in any of my test data that isn't
already getting caught. The only white-on-white Unsure I've seen would have
been called spam instead then, but that would require more than just noting
which color and background values were being used (since I've only seen this
once, they would be hapaxes, unique to that single spam).
Real parsing is probably inevitable someday, though. It's too easy to fool
a cheap-ass gimmick. But for now, almost nothing does. BTW, real parsing
is much harder than just using a real parser <0.9 wink>, because so much
HTML is ill-formed.
More information about the Spambayes