[Spambayes] how spambayes handles image-only spams

Tue Sep 9 00:11:02 EDT 2003

>>> E pur si moivre, dude.  E pur si moivre.

>> There more kinds of email users in heaven and earth than are
>> dreamt of in your classifier, Bill.

[Bill Yerazunis]
> That's one thing I _like_ about this list.  At least y'all are
> moderately literate.  :-)

Perhaps, but I actually had no idea what epursimoivre meant <wink>.

> ...
> Well, on the grounds that the SpamAssassin corpus is a little less
> biased, I re-ran the tests against the .css files that the SA test
> corpus generates (using the TOE learning strategy).  Accuracy on this
> corpus is just over 98% for crm114, and barely 70% for me-the-human.
>
> The results for SA test corpus:
>
> Token  Spam Nonspam
> <p>     143     144
> <br>    380     289
> <td>     67     119
> <font   305     281
> <a      218     346
>
> So, it seems that "font" is somewhat spammy, and so is "br",
> but <a and <td aren't, and <p> is totally equivocal.
>
> Does this help?  :)

It helps Ryan's thesis that HTML isn't uncommon in ham.  If the stats had
been more like

  Token  Spam Nonspam
  <p>     143      10
  <br>    380       5
  <td>     67       0
  <font   305       0
  <a      218       4

which is much closer to my actual ham-spam breakdown wrt HTML, I believe
CR114 would have had a very hard time classifying the ham correctly (note
that there would also be literally dozens of other distinct unique-to-HTML
strings in that ham too, all with high spam counts and low ham counts).
*This* kind of distribution is what spambayes had lots of experience with,
and is what caused us to throw away HTML decorations.