[Spambayes] how spambayes handles image-only spams

Bill Yerazunis wsy at merl.com
Tue Sep 9 11:19:53 EDT 2003


   From: "Tim Peters" <tim.one at comcast.net>

   >>> E pur si moivre, dude.  E pur si moivre.

   >> There more kinds of email users in heaven and earth than are
   >> dreamt of in your classifier, Bill.

   [Bill Yerazunis]
   > That's one thing I _like_ about this list.  At least y'all are
   > moderately literate.  :-)

   Perhaps, but I actually had no idea what epursimoivre meant <wink>.

E pur si moivre --> "Nevertheless, it _does_ move".  Apocryphally,
what Galileo said after being tortured continuously by the Spanish
Inquisition for four days and finally recanting his observational
evidence that the Earth moved.

   > Well, on the grounds that the SpamAssassin corpus is a little less
   > biased, I re-ran the tests against the .css files that the SA test
   > corpus generates (using the TOE learning strategy).  Accuracy on this
   > corpus is just over 98% for crm114, and barely 70% for me-the-human.
   >
   > The results for SA test corpus:
   >
   > Token  Spam Nonspam
   > <p>     143     144
   > <br>    380     289
   > <td>     67     119
   > <font   305     281
   > <a      218     346
   >
   > So, it seems that "font" is somewhat spammy, and so is "br",
   > but <a and <td aren't, and <p> is totally equivocal.
   >
   > Does this help?  :)

   It helps Ryan's thesis that HTML isn't uncommon in ham.  If the stats had
   been more like

     Token  Spam Nonspam
     <p>     143      10
     <br>    380       5
     <td>     67       0
     <font   305       0
     <a      218       4

   which is much closer to my actual ham-spam breakdown wrt HTML, I believe
   CR114 would have had a very hard time classifying the ham correctly (note
   that there would also be literally dozens of other distinct unique-to-HTML
   strings in that ham too, all with high spam counts and low ham counts).
   *This* kind of distribution is what spambayes had lots of experience with,
   and is what caused us to throw away HTML decorations.

Well, what happens in CRM114 is not that the HTML causes confusion, it
does get factored in, but when you have a nearly 1:1 ratio in the 
hits, it basically doesn't make any difference to the end value.

How well does SpamBayes do on the SpamAssassin test corpus?

    -Bill Yerazunis



More information about the Spambayes mailing list