[Spambayes] how spambayes handles image-only spams

Tue Sep 9 13:39:06 EDT 2003

>>>> E pur si moivre, dude.  E pur si moivre.

>>> There more kinds of email users in heaven and earth than are
>>> dreamt of in your classifier, Bill.
    [Bill Yerazunis]

>>> That's one thing I _like_ about this list.  At least y'all are
>>> moderately literate.  :-)

>> Perhaps, but I actually had no idea what epursimoivre meant <wink>.

[Bill Yerazunis]
> E pur si moivre --> "Nevertheless, it _does_ move".  Apocryphally,
> what Galileo said after being tortured continuously by the Spanish
> Inquisition for four days and finally recanting his observational
> evidence that the Earth moved.

Ah!  I think memory is mixing Italian with French here.  s/moivre/muove/ and
google goes from 0 hits to thousands.  That's easy to remember because then
"e pur si muove" is an anagram of "pursue movie" <wink>.

>>> Well, on the grounds that the SpamAssassin corpus is a little
>>> less biased, I re-ran the tests against the .css files that the
>>> SA test corpus generates (using the TOE learning strategy).
>>> Accuracy on this corpus is just over 98% for crm114, and barely
>>> 70% for me-the-human.  The results for SA test corpus:
>>>
>>> Token  Spam Nonspam
>>> <p>     143     144
>>> <br>    380     289
>>> <td>     67     119
>>> <font   305     281
>>> <a      218     346
>>>
>>> So, it seems that "font" is somewhat spammy, and so is "br",
>>> but <a and <td aren't, and <p> is totally equivocal.    >
>>> Does this help?  :)

[Tim]
>> It helps Ryan's thesis that HTML isn't uncommon in ham.  If the
>> stats had been more like
>>
>>      Token  Spam Nonspam
>>      <p>     143      10
>>      <br>    380       5
>>      <td>     67       0
>>      <font   305       0
>>      <a      218       4
>>
>> which is much closer to my actual ham-spam breakdown wrt HTML, I
>> believe CR114 would have had a very hard time classifying the ham
>> correctly (note that there would also be literally dozens of other
>> distinct unique-to-HTML strings in that ham too, all with high
>> spam counts and low ham counts).  *This* kind of distribution is
>> what spambayes had lots of experience with, and is what caused us
>> to throw away HTML decorations.

[Bill]
> Well, what happens in CRM114 is not that the HTML causes confusion, it
> does get factored in, but when you have a nearly 1:1 ratio in the
> hits, it basically doesn't make any difference to the end value.

Sure.  We keep talking past each other here, and I don't know why.
spambayes doesn't strip HTML because it's afraid of HTML, it's specifically
to avoid penalizing ham for the mere presence of HTML, for those people
(like most of the spambayes developers) who have very little (but yet some)
HTML ham.  It's not the 1::1 corpora that drove the decision, it was the
100::1 corpora (which is in fact typical of my own email).

> How well does SpamBayes do on the SpamAssassin test corpus?

I haven't tried it, and I don't recall anyone else here trying it.  It
wasn't interesting to me because spambayes was initially developed to filter
Mailman mailing lists, and later expanded to single-user classifiers.  Those
both have a kind of focus that aggregating many peoples' ham would lose, and
finding out how badly that loses would be of only idle academic interest to
me.  It's of more interest to other people here, though, so I'll let them do
the work <wink>.