[Spambayes] Graphs on my website

Bill Yerazunis wsy at merl.com
Wed Mar 5 08:52:19 EST 2003


   From: niek at haunter.student.utwente.nl (Niek Bergboer)

   In a ROC-curve (Receiver Operating Characteristic curve), you plot the
   correct positive rate (y-axis) against the false positive rate (x-axis). The
   points on the curve are given by using e.g. different spam:ham
   ratio's. A ROC-curve doesn't necessarily provide more information, but
   it is a rather standard way to present results in (more or less)
   binary classification. The term ROC originates from RADAR detection
   results, AFAIK.

   A problem that needs to be addressed in making ROC-curves for
   spambayes is how to handle unsures: disregarding them completely in
   the ROC curve seems reasonable, but then one probably also needs a
   correct.pos.rate vs. unsures rate curve.

The ROC curves I've seen are all plots of correct% v incorrect% with
the parameterization variable being some controllable threshold that's
an input to the system; the closer the "knee" in the curve comes to
the origin, the better the discrimination, and the parameter value(s)
at the point of closest approach are the optimal operating parameters .

In the case of SpamBayes, where there's a distinct "third class",
I'd suggest _three_ curves:

    Ham v. Unsure
    Unsure  v. Spam
    Ham v. Spam

This would plot the confusion on all three axes, and make it clear that
you can drive the third one (ham v. spam) really close to the
origin (which is good) by expanding the size of the Unsure class.

       -Bill Yerazunis ( CRM114 spy :-) )
    



More information about the Spambayes mailing list