[Spambayes] how spambayes handles image-only spams
wsy at merl.com
Tue Sep 9 11:19:53 EDT 2003
From: "Tim Peters" <tim.one at comcast.net>
>>> E pur si moivre, dude. E pur si moivre.
>> There more kinds of email users in heaven and earth than are
>> dreamt of in your classifier, Bill.
> That's one thing I _like_ about this list. At least y'all are
> moderately literate. :-)
Perhaps, but I actually had no idea what epursimoivre meant <wink>.
E pur si moivre --> "Nevertheless, it _does_ move". Apocryphally,
what Galileo said after being tortured continuously by the Spanish
Inquisition for four days and finally recanting his observational
evidence that the Earth moved.
> Well, on the grounds that the SpamAssassin corpus is a little less
> biased, I re-ran the tests against the .css files that the SA test
> corpus generates (using the TOE learning strategy). Accuracy on this
> corpus is just over 98% for crm114, and barely 70% for me-the-human.
> The results for SA test corpus:
> Token Spam Nonspam
> <p> 143 144
> <br> 380 289
> <td> 67 119
> <font 305 281
> <a 218 346
> So, it seems that "font" is somewhat spammy, and so is "br",
> but <a and <td aren't, and <p> is totally equivocal.
> Does this help? :)
It helps Ryan's thesis that HTML isn't uncommon in ham. If the stats had
been more like
Token Spam Nonspam
<p> 143 10
<br> 380 5
<td> 67 0
<font 305 0
<a 218 4
which is much closer to my actual ham-spam breakdown wrt HTML, I believe
CR114 would have had a very hard time classifying the ham correctly (note
that there would also be literally dozens of other distinct unique-to-HTML
strings in that ham too, all with high spam counts and low ham counts).
*This* kind of distribution is what spambayes had lots of experience with,
and is what caused us to throw away HTML decorations.
Well, what happens in CRM114 is not that the HTML causes confusion, it
does get factored in, but when you have a nearly 1:1 ratio in the
hits, it basically doesn't make any difference to the end value.
How well does SpamBayes do on the SpamAssassin test corpus?
More information about the Spambayes