[Spambayes] how spambayes handles image-only spams
tim.one at comcast.net
Sun Sep 7 21:04:43 EDT 2003
> But as you mentioned, the "everything is a token" approach hasn't
> worked all that well with SpamBayes and HTML. Perhaps it would fare
> much better with a "sliding window" tokenizer such as that used by
> CRM-114. I'm betting CRM-114 would hammer "img src http" and other
> image-related tag sequences with my corpus,
I bet it would too, although you seem to be wishing away the "=value" parts
of tags here; I don't think CRM114 ignores them.
> while leaving <P>, <BR> and the like alone.
So give it a try and report back <0.1 wink>.
> Has anyone looked at implementing a sparse windowing system similar to
> CRM-114's in SpamBayes?
Yes, and there's a lot about it in the archives, probably hard to find now
but too detailed to be worth the effort of summarizing now.
> Intuitively, it seems like this would do much better with HTML tags
For the kinds of people (unlike you) who get just a little bit of HTML ham,
I expect it would be even worse than the first spambayes was for them (last
year, at the start, spambayes didn't throw out HTML decorations; and that
was a disaster, as we've covered, for most spambayes users who have little
HTML ham; I expect CRM114 would penalize the mere presence of HTML even more
heavily than that spambayes did).
> as well as mail header information, at the expense of CPU and DB
Both of those, yes. The use of hashing (which current CRM114 may or may not
do anymore) also caused baffling mistakes, where "baffling" == "makes no
sense to humans".
More information about the Spambayes