[Spambayes] how spambayes handles image-only spams
tim.one at comcast.net
Tue Sep 9 00:11:02 EDT 2003
>>> E pur si moivre, dude. E pur si moivre.
>> There more kinds of email users in heaven and earth than are
>> dreamt of in your classifier, Bill.
> That's one thing I _like_ about this list. At least y'all are
> moderately literate. :-)
Perhaps, but I actually had no idea what epursimoivre meant <wink>.
> Well, on the grounds that the SpamAssassin corpus is a little less
> biased, I re-ran the tests against the .css files that the SA test
> corpus generates (using the TOE learning strategy). Accuracy on this
> corpus is just over 98% for crm114, and barely 70% for me-the-human.
> The results for SA test corpus:
> Token Spam Nonspam
> <p> 143 144
> <br> 380 289
> <td> 67 119
> <font 305 281
> <a 218 346
> So, it seems that "font" is somewhat spammy, and so is "br",
> but <a and <td aren't, and <p> is totally equivocal.
> Does this help? :)
It helps Ryan's thesis that HTML isn't uncommon in ham. If the stats had
been more like
Token Spam Nonspam
<p> 143 10
<br> 380 5
<td> 67 0
<font 305 0
<a 218 4
which is much closer to my actual ham-spam breakdown wrt HTML, I believe
CR114 would have had a very hard time classifying the ham correctly (note
that there would also be literally dozens of other distinct unique-to-HTML
strings in that ham too, all with high spam counts and low ham counts).
*This* kind of distribution is what spambayes had lots of experience with,
and is what caused us to throw away HTML decorations.
More information about the Spambayes