[Spambayes] how spambayes handles image-only spams

Tim Peters tim.one at comcast.net
Sun Sep 7 21:04:43 EDT 2003


[Ryan Malayter]
> ...
> But as you mentioned, the "everything is a token" approach hasn't
> worked all that well with SpamBayes and HTML. Perhaps it would fare
> much better with a "sliding window" tokenizer such as that used by
> CRM-114. I'm betting CRM-114 would hammer "img src http" and other
> image-related tag sequences with my corpus,

I bet it would too, although you seem to be wishing away the "=value" parts
of tags here; I don't think CRM114 ignores them.

> while leaving <P>, <BR> and the like alone.

So give it a try and report back <0.1 wink>.

> Has anyone looked at implementing a sparse windowing system similar to
> CRM-114's in SpamBayes?

Yes, and there's a lot about it in the archives, probably hard to find now
but too detailed to be worth the effort of summarizing now.

> Intuitively, it seems like this would do much better with HTML tags

For the kinds of people (unlike you) who get just a little bit of HTML ham,
I expect it would be even worse than the first spambayes was for them (last
year, at the start, spambayes didn't throw out HTML decorations; and that
was a disaster, as we've covered, for most spambayes users who have little
HTML ham; I expect CRM114 would penalize the mere presence of HTML even more
heavily than that spambayes did).

> as well as mail header information, at the expense of CPU and DB
> storage.

Both of those, yes.  The use of hashing (which current CRM114 may or may not
do anymore) also caused baffling mistakes, where "baffling" == "makes no
sense to humans".




More information about the Spambayes mailing list