[Spambayes] how spambayes handles image-only spams
Tim Peters
tim.one at comcast.net
Sun Sep 7 21:04:43 EDT 2003
[Ryan Malayter]
> ...
> But as you mentioned, the "everything is a token" approach hasn't
> worked all that well with SpamBayes and HTML. Perhaps it would fare
> much better with a "sliding window" tokenizer such as that used by
> CRM-114. I'm betting CRM-114 would hammer "img src http" and other
> image-related tag sequences with my corpus,
I bet it would too, although you seem to be wishing away the "=value" parts
of tags here; I don't think CRM114 ignores them.
> while leaving <P>, <BR> and the like alone.
So give it a try and report back <0.1 wink>.
> Has anyone looked at implementing a sparse windowing system similar to
> CRM-114's in SpamBayes?
Yes, and there's a lot about it in the archives, probably hard to find now
but too detailed to be worth the effort of summarizing now.
> Intuitively, it seems like this would do much better with HTML tags
For the kinds of people (unlike you) who get just a little bit of HTML ham,
I expect it would be even worse than the first spambayes was for them (last
year, at the start, spambayes didn't throw out HTML decorations; and that
was a disaster, as we've covered, for most spambayes users who have little
HTML ham; I expect CRM114 would penalize the mere presence of HTML even more
heavily than that spambayes did).
> as well as mail header information, at the expense of CPU and DB
> storage.
Both of those, yes. The use of hashing (which current CRM114 may or may not
do anymore) also caused baffling mistakes, where "baffling" == "makes no
sense to humans".
More information about the Spambayes
mailing list