[Spambayes] how spambayes handles image-only spams

Bill Yerazunis wsy at merl.com
Sun Sep 7 21:25:00 EDT 2003


   From: "Tim Peters" <tim.one at comcast.net>

   [Ryan Malayter]
   > ...
   > But as you mentioned, the "everything is a token" approach hasn't
   > worked all that well with SpamBayes and HTML. Perhaps it would fare
   > much better with a "sliding window" tokenizer such as that used by
   > CRM-114. I'm betting CRM-114 would hammer "img src http" and other
   > image-related tag sequences with my corpus,

   I bet it would too, although you seem to be wishing away the "=value" parts
   of tags here; I don't think CRM114 ignores them.

It both ignores them and does NOT ignore them.  That's the beauty
of the sparse binary polynomial hash - if you have the phrase:

   beware the jabberwock my son

you get all these token features:

    beware
    beware the
    beware jabberwock
    beware the jabberwock
    beware my
    beware the my
    beware jabberwock my
    beware the jabberwock my
    beware son
    beware the son
    beware jabberwock son
    beware the jabberwock son
    beware my son
    beware the my son
    beware jabberwock my son
    beware the jabberwock my son

so you both would, and would not get features corresponding to =value stuff.

Now, both the woulds and would-nots get trained in, but since the 
would-nots would statistically dominate the woulds, you have no need
to worry about the woulds; they'd be (eventually) groomed out of the
database as room was needed for other tokens.

   For the kinds of people (unlike you) who get just a little bit of HTML ham,
   I expect it would be even worse than the first spambayes was for them (last
   year, at the start, spambayes didn't throw out HTML decorations; and that
   was a disaster, as we've covered, for most spambayes users who have little
   HTML ham; I expect CRM114 would penalize the mere presence of HTML even more
   heavily than that spambayes did).

Only as heavily as the statistics indicate.  CRM114 has no ingrained notion
of "good" or "bad", only a statistical aggregation of those ideals.  If
YOUR mail has a lot of HTML spam, then yeah, it will weight against.  If
you have a lot of good HTML mail, then there will not be any significant
prejudice.  

You will also get weights according to the kind of HTML tags you get.
If your spam uses a lot of tables and font colors, but your regular email
does not, then <table>, <td>, <font> etc will have "spam" statistics
but <p>, <br>, and <h1> won't.

   > as well as mail header information, at the expense of CPU and DB
   > storage.

   Both of those, yes.  The use of hashing (which current CRM114 may or may not
   do anymore) also caused baffling mistakes, where "baffling" == "makes no
   sense to humans".

It still uses hashing, but now with effectively a 64-bit key that _is_ 
checked.  The chance of a hash clash is something like 10^-20th, a number
so small that I don't know the pseudogreek prefix for it but it's smaller
than a nano-nano-chance.

     -Bill Yerazunis



More information about the Spambayes mailing list