[Spambayes] A curious max_discrimators result

Tim Peters tim.one@comcast.net
Sun, 22 Sep 2002 05:49:01 -0400


My usual large test here.  "Before":

"""
[Classifier]
use_robinson_probability: True
use_robinson_combining: True
max_discriminators: 1500

[TestDriver]
spam_cutoff: 0.575
"""

"After" is identical except for changing max_discriminator to 16 (this had
never been tried before (by me) with both use_robinson_xyz options True):

"""
false positive percentages
    0.000  0.050  lost  +(was 0)
    0.000  0.050  lost  +(was 0)
    0.000  0.050  lost  +(was 0)
    0.000  0.000  tied
    0.050  0.100  lost  +100.00%
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied
    0.000  0.000  tied
    0.050  0.050  tied

won   0 times
tied  6 times
lost  4 times

total unique fp went from 3 to 7 lost  +133.33%
mean fp % went from 0.015 to 0.035 lost  +133.33%

false negative percentages
    0.071  0.143  lost  +101.41%
    0.286  0.214  won    -25.17%
    0.071  0.000  won   -100.00%
    0.214  0.143  won    -33.18%
    0.143  0.214  lost   +49.65%
    0.214  0.071  won    -66.82%
    0.214  0.143  won    -33.18%
    0.357  0.071  won    -80.11%
    0.357  0.071  won    -80.11%
    0.000  0.000  tied

won   7 times
tied  1 times
lost  2 times

total unique fn went from 27 to 15 won    -44.44%
mean fn % went from 0.192857142857 to 0.107142857143 won    -44.44%
"""

I'm starting to believe that any tweak whatsoever will reduce the f-n rate.
BTW, this also tells me that I'm getting almost all the good there is to get
(in my data) out of the 16 most extreme clues.  OTOH, Neil reported that
reducing max_discriminators (from 1,500) hurt him, but didn't quantify it
(like how much it hurt, or how much he reduced it).

Further result:  if I raised spam_cutoff to 0.60 in the "after" run, 6(!) of
the fps would go away, and 15 f-n would be gained, leaving a grand total of
1 fp and 30 fn.  Bizarre:  this would let both the Nigerian scam quote(!)
and the lady with the obnoxious sig through.  The only ham it would stop is
this, which I've never posted before:

"""
Data/Ham/Set2/4059.txt
prob = 0.609747546095
prob('header:Organization:1') = 0.0124308
prob('script') = 0.0167547
prob('thanks.') = 0.119632
prob('mark') = 0.133815
prob('notification') = 0.885737
prob('protection') = 0.897292
prob('traffic') = 0.909737
prob('anti') = 0.909836
prob('information') = 0.911857
prob('please') = 0.929992
prob('address') = 0.946832
prob('age') = 0.951843
prob('email') = 0.960494
prob('davidson') = 0.964517
prob('verification') = 0.975702
prob('webmasters') = 0.981481

From: huy4huy@cs.com (HUY4HUY)
Newsgroups: comp.lang.python
Subject: need scripts...help
Lines: 13
NNTP-Posting-Host: ladder06.news.cs.com
X-Admin: news@cs.com
Date: 2 Jun 1999 18:47:40 GMT
Organization: CompuServe (http://www.compuserve.com/)
Message-ID: <19990602144740.17875.00000120@ng-cf1.news.cs.com>
Path:
news!uunet!ffx.uu.net!newsfeed.us.ibm.net!ibm.net!news-spur1.maxwell.syr.edu
!news.maxwell.syr.edu!newsfeed.cwix.com!152.163.199.19!portc03.blue.aol.com!
portc03.blue.cs.com!audrey03.news.cs.com!not-for-mail
Xref: news comp.lang.python:64176
To: python-list@python.org

I need a program that would automate an age verification system, I need the
script to handle all aspects.

It should have easy sign up for webmasters and surfers, anti cheater devices
as
well as the real time monitoring of all system activities, email
notification
to the webmaster joining to the admin, logs all traffic trough the system
and
gathers email addresses, built in password protection for admin area, no
SSL's,
no htaccess...this is an example of the format I need.  I will appreciate
it,
if anyone can help me out......thanks
--------------------------------------------------------------------
Mark Davidson             inert28@hotmail.com
please inform with the information to the email address above, thanks.
"""