[Spambayes] how spambayes handles image-only spams
Bill Yerazunis
wsy at merl.com
Sun Sep 7 21:25:00 EDT 2003
From: "Tim Peters" <tim.one at comcast.net>
[Ryan Malayter]
> ...
> But as you mentioned, the "everything is a token" approach hasn't
> worked all that well with SpamBayes and HTML. Perhaps it would fare
> much better with a "sliding window" tokenizer such as that used by
> CRM-114. I'm betting CRM-114 would hammer "img src http" and other
> image-related tag sequences with my corpus,
I bet it would too, although you seem to be wishing away the "=value" parts
of tags here; I don't think CRM114 ignores them.
It both ignores them and does NOT ignore them. That's the beauty
of the sparse binary polynomial hash - if you have the phrase:
beware the jabberwock my son
you get all these token features:
beware
beware the
beware jabberwock
beware the jabberwock
beware my
beware the my
beware jabberwock my
beware the jabberwock my
beware son
beware the son
beware jabberwock son
beware the jabberwock son
beware my son
beware the my son
beware jabberwock my son
beware the jabberwock my son
so you both would, and would not get features corresponding to =value stuff.
Now, both the woulds and would-nots get trained in, but since the
would-nots would statistically dominate the woulds, you have no need
to worry about the woulds; they'd be (eventually) groomed out of the
database as room was needed for other tokens.
For the kinds of people (unlike you) who get just a little bit of HTML ham,
I expect it would be even worse than the first spambayes was for them (last
year, at the start, spambayes didn't throw out HTML decorations; and that
was a disaster, as we've covered, for most spambayes users who have little
HTML ham; I expect CRM114 would penalize the mere presence of HTML even more
heavily than that spambayes did).
Only as heavily as the statistics indicate. CRM114 has no ingrained notion
of "good" or "bad", only a statistical aggregation of those ideals. If
YOUR mail has a lot of HTML spam, then yeah, it will weight against. If
you have a lot of good HTML mail, then there will not be any significant
prejudice.
You will also get weights according to the kind of HTML tags you get.
If your spam uses a lot of tables and font colors, but your regular email
does not, then <table>, <td>, <font> etc will have "spam" statistics
but <p>, <br>, and <h1> won't.
> as well as mail header information, at the expense of CPU and DB
> storage.
Both of those, yes. The use of hashing (which current CRM114 may or may not
do anymore) also caused baffling mistakes, where "baffling" == "makes no
sense to humans".
It still uses hashing, but now with effectively a 64-bit key that _is_
checked. The chance of a hash clash is something like 10^-20th, a number
so small that I don't know the pseudogreek prefix for it but it's smaller
than a nano-nano-chance.
-Bill Yerazunis
More information about the Spambayes
mailing list