[Python-checkins] python/nondist/sandbox/spambayes classifier.py,1.6,1.7

Sun, 01 Sep 2002 00:22:29 -0700

Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv10839

Modified Files:
	classifier.py 
Log Message:
Added a comment block about HAMBIAS experiments.  There's no clearer
example of trading off precision against recall, and you can favor either
at the expense of the other to any degree you like by fiddling this knob.


Index: classifier.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/classifier.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** classifier.py	1 Sep 2002 00:05:41 -0000	1.6
--- classifier.py	1 Sep 2002 07:22:27 -0000	1.7
***************
*** 9,12 ****
--- 9,28 ----
  from heapq import heapreplace
  
+ # The count of each word in ham is artificially boosted by a factor of
+ # HAMBIAS, and similarly for SPAMBIAS.  Graham uses 2.0 and 1.0.  Final
+ # results are very sensitive to the HAMBIAS value.  On my 5x5 c.l.py
+ # test grid with 20,000 hams and 13,750 spams split into 5 pairs, then
+ # across all 20 test runs (for each pair, training on that pair then scoring
+ # against the other 4 pairs), and counting up all the unique msgs ever
+ # identified as false negative or positive, then compared to HAMBIAS 2.0,
+ #
+ # At HAMBIAS 1.0
+ #    total unique false positives goes up   by a factor of 7.6 ( 23 -> 174)
+ #    total unique false negatives goes down by a factor of 2   (337 -> 166)
+ #
+ # At HAMBIAS 3.0
+ #    total unique false positives goes down by a factor of 4.6 ( 23 ->   5)
+ #    total unique false negatives goes up   by a factor of 2.1 (337 -> 702)
+ 
  HAMBIAS  = 2.0
  SPAMBIAS = 1.0