[Spambayes] Re: For the bold

Rob Hooft rob@hooft.net
Sun, 06 Oct 2002 20:25:20 +0200


I made a number of changes to rmspik.py:

  - The "chance" function was replaced by something a bit more 
scientific (this helps!).
  - There are new parameters in the source code (I'm hoping someone else 
can make these configurable through the .ini file).

# surefactor: the ratio of the two p's to decide we're sure a message
# belongs to one of the two populations.  raising this number increases
# the "unsures" on both sides, decreasing the "sure fp" and "sure fn"
# rates.  A value of 1000 works well for me; at 10000 you get slightly
# less sure fp/fn at a cost of a lot more middle ground; at 10 you have
# much less work on the middle ground but ~50% more "sure false"
# scores.  This variable operates on messages that are "a bit of both
# ham and spam"
surefactor = 100

# pminhamsure: The minimal pham at which we say it's surely ham
# lowering this value gives less "unsure ham" and more "sure ham"; it
# might however result in more "sure fn" 0.01 works well, but to accept
# a bit more fn, I set it to 0.005. This variable operates on messages
# that are "neither ham nor spam; but a bit more ham than spam"
pminhamsure = 0.005

# pminspamsure: The minimal pspam at which we say it's surely spam
# lowering this value gives less "unsure spam" and more "sure spam"; it
# might however result in more "sure fp" Since most people find fp
# worse than fn, this value should most probably be higher than
# pminhamsure. 0.01 works well, but to accept a bit less fp, I set it
# to 0.02.  This variable operates on messages that are "neither ham
# nor spam; but a bit more spam than ham"
pminspamsure = 0.02


# usetail: if False, use complete distributions to renormalize the
# Z-scores; if True, use only the worst tail value. I get worse results
# if I set this to True, so the default is False.
usetail = False

# medianoffset: If True, set the median of the zham and zspam to 0
# before calculating rmsZ. If False, do not shift the data and hence
# assume that 0 is the center of the population. True seems to help for
# my data.
medianoffset = True

I'd like to invite everyone to play with this. It takes only a few 
seconds to run once the .pik is set up using "clgen"!

I'll post some of my results under separate cover.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/