[Spambayes] Re: For the bold
Rob Hooft
rob@hooft.net
Sun, 06 Oct 2002 20:25:20 +0200
I made a number of changes to rmspik.py:
- The "chance" function was replaced by something a bit more
scientific (this helps!).
- There are new parameters in the source code (I'm hoping someone else
can make these configurable through the .ini file).
# surefactor: the ratio of the two p's to decide we're sure a message
# belongs to one of the two populations. raising this number increases
# the "unsures" on both sides, decreasing the "sure fp" and "sure fn"
# rates. A value of 1000 works well for me; at 10000 you get slightly
# less sure fp/fn at a cost of a lot more middle ground; at 10 you have
# much less work on the middle ground but ~50% more "sure false"
# scores. This variable operates on messages that are "a bit of both
# ham and spam"
surefactor = 100
# pminhamsure: The minimal pham at which we say it's surely ham
# lowering this value gives less "unsure ham" and more "sure ham"; it
# might however result in more "sure fn" 0.01 works well, but to accept
# a bit more fn, I set it to 0.005. This variable operates on messages
# that are "neither ham nor spam; but a bit more ham than spam"
pminhamsure = 0.005
# pminspamsure: The minimal pspam at which we say it's surely spam
# lowering this value gives less "unsure spam" and more "sure spam"; it
# might however result in more "sure fp" Since most people find fp
# worse than fn, this value should most probably be higher than
# pminhamsure. 0.01 works well, but to accept a bit less fp, I set it
# to 0.02. This variable operates on messages that are "neither ham
# nor spam; but a bit more spam than ham"
pminspamsure = 0.02
# usetail: if False, use complete distributions to renormalize the
# Z-scores; if True, use only the worst tail value. I get worse results
# if I set this to True, so the default is False.
usetail = False
# medianoffset: If True, set the median of the zham and zspam to 0
# before calculating rmsZ. If False, do not shift the data and hence
# assume that 0 is the center of the population. True seems to help for
# my data.
medianoffset = True
I'd like to invite everyone to play with this. It takes only a few
seconds to run once the .pik is set up using "clgen"!
I'll post some of my results under separate cover.
Rob
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/