[Spambayes] how spambayes handles image-only spams

Tim Peters tim.one at comcast.net
Sun Sep 7 21:42:54 EDT 2003


[Bill Yerazunis]
> It both ignores them and does NOT ignore them.  That's the beauty
> of the sparse binary polynomial hash - if you have the phrase:
>
>    beware the jabberwock my son
>
> you get all these token features:
>
>     beware
>     beware the
>     beware jabberwock
>     beware the jabberwock
>     beware my
>     beware the my
>     beware jabberwock my
>     beware the jabberwock my
>     beware son
>     beware the son
>     beware jabberwock son
>     beware the jabberwock son
>     beware my son
>     beware the my son
>     beware jabberwock my son
>     beware the jabberwock my son
>
> so you both would, and would not get features corresponding to =value
> stuff.

I don't know -- it depends on how you split the bytestream into units.  When
Ryan hypothesized he'd see "img src http", it's assuming all sorts of stuff,
such as that you drop the "<" presumably preceding "img", drop the '="'
presumably following "src", fold case, and so on.

> Now, both the woulds and would-nots get trained in, but since the
> would-nots would statistically dominate the woulds, you have no need
> to worry about the woulds; they'd be (eventually) groomed out of the
> database as room was needed for other tokens.

>> For the kinds of people (unlike you) who get just a little bit
>> of HTML ham, I expect it would be even worse than the first spambayes
>> was for them (last year, at the start, spambayes didn't throw out HTML
>> decorations; and that was a disaster, as we've covered, for most
>> spambayes users who have little HTML ham; I expect CRM114 would penalize
>> the mere presence of HTML even more heavily than that spambayes did).

> Only as heavily as the statistics indicate.

Of course, but I said "for those who get just a little HTML ham" twice in
the quoted paragraph.  I don't think the context was unclear.

> CRM114 has no ingrained notion of "good" or "bad", only a statistical
> aggregation of those ideals.

Same here.

> If YOUR mail has a lot of HTML spam, then yeah, it will weight against.
> If you have a lot of good HTML mail, then there will not be any
> significant prejudice.

The quoted paragraph was talking about people with little HTML ham.

> You will also get weights according to the kind of HTML tags you get.
> If your spam uses a lot of tables and font colors, but your regular
> email does not, then <table>, <td>, <font> etc will have "spam"
> statistics but <p>, <br>, and <h1> won't.

Try it and see what happens.  spambayes only uses unigrams, and we had
plenty of experience with this.  CRM114 will have all those unigrams too,
plus a giant (compared to just the unigrams) collection of "hey, there was
HTML!" pairs, triples, etc.  The unigrams alone are enough to kill the
classifier's usefulness for people with little (but not no) HTML ham.

> ...
> It still uses hashing, but now with effectively a 64-bit key that _is_
> checked.

Don't know what it means to check a 64-bit key.  Maybe it means you're using
64-bit hash codes now, and that a hash code leads to a chain of original
tokens hashing to that code, and "check" means the incoming token is
compared to the entries in the hash chain.  Or maybe it doesn't <wink>.

>  The chance of a hash clash is something like 10^-20th,

The chance of a hash clash on a pair of 64-bit keys is 2**-64, of course,
but you can expect to see the first collision after hashing O(2**32) tokens.

> a number so small that I don't know the pseudogreek prefix for it but
> it's smaller than a nano-nano-chance.

The Birthday Paradox lowers it back into the comfortably small billions
<wink>.




More information about the Spambayes mailing list