[spambayes-dev] subjective assessment of bigrams

Skip Montanaro skip at pobox.com
Wed Jan 7 09:48:42 EST 2004


    Toby> I am using my overnight-train-on-everything regime, with 14000
    Toby> hams and 2000 spams.

    >> Wow!  Any chance you could whack off the oldest 12,000 or so hams to
    >> bring your ham:spam ratio back into balance?

    Toby> Im not sure if you intended a ;-) in there. I did try that a while
    Toby> ago (before bigrams) with no subjective difference.

Maybe.  Maybe not. ;-) I didn't think of it as smiley territory, but I
suppose it could be interpreted that way.  Most folks running with a very
unbalanced training database experience problems.

Can you easily run the cross-validation tests?  If so, you might try
breaking your ham and spam up into a structure suitable for running timcv.
That would be one message per file in the Data/Ham/SetN and Data/Spam/SetN
format.  You can build this structure easily using the splitndirs.py script
in the utilities directory.

Then run timcv.py giving it the --HamTrain option to restrict the number of
hams per set from 200 to 1400 (with ten sets your spam dir will have about
200 messages).  Compare the results as you vary the arg to HamTrain and see
how (if) things change as the number of hams used per set changes.  (Also,
use the -s flag to use a constant random number seed so if you run twice
with the same --HamTrain arg you select the same subset of messages and get
the same results.)

Skip




More information about the spambayes-dev mailing list