[Spambayes] More experiments with weaktest.py
Rob Hooft
rob@hooft.net
Sun Nov 10 12:11:46 2002
Tim Peters wrote:
> [Rob Hooft]
>
>>These were results of weaktest with default parameters:
>
>
> Very interesting! I'll have to try that too. Note that in my live email
> experiment here, I'm (except for the very start) also scoring/training msgs
> in (with small lapses) the order they arrive. It's been reported before
> that this helps; although I still haven't run a controlled experiment on
> that, my *impression* is that it does help.
I toyed with the idea, but that would involve parsing all messages once
before starting, and sorting them on date. Putting them in a set to
"randomize" the order is much easier, so I was lazy.
> Setting ham_cutoff as low as 10 is for the
> truly paranoid <0.9 wink>.
Very much so. For my "production" systems, I have ham_cutoff at 40...
> I hope you're at least gaining some respect for how much work went into
> picking the defaults <wink>.
I was just arriving when it happened. But that was on a completely
different classifier, so I'm still convinced these need to be thoroughly
tested.
>>I am back with the defaults, but I'd still like to do an automated
>>optimization of everything simultaneously. Might try that.
> Now *that* could be a useful system regardless of scheme. I've tended to do
> hill-climbing across one dimension at a time, occasionally moving batches of
> params random amounts at once (to see whether that kicks it out of a
> stubborn local minimum).
Hm. That sounds so enthousiastic that I just might commit what I have
gone through this night. Some more info:
* No, I have not used a "Simulated Annealing" or "Threshold Accepting"
yet. Please keep in mind that each step in the optimization takes
between 3 minutes (1 set on my home PC) and 15 minutes (10 sets on my
work PC). This would be way too costly. Just minimization it will be.
* I tried to use "Simplex optimization" (let a multidimensional
triangle walk through phase space) on the "Total cost" parameter.
This was simply disastrous. Phase space consists of plateau regions
that are exactly flat, joined by huge ridges. Think about that one
spam that goes from a 0.11 to a 0.09 score: it will add $9.80 in one
bang to the cost. This field is impossible to optimize.
* I designed a new "Flex cost" field. That one does away with the
"unsure cost". The cost of a message is 0.0 at its own cutoff, and
increases linearly towards its "false" cost at the other cutoff,
and increases further to the other end. Hm. Unreadable. A table:
Score Spam with this Ham with this
score costs score costs
0.00 $ 1.29 $ 0.00
0.20 $ 1.00 $ 0.00
0.55 $ 0.50 $ 5.00
0.90 $ 0.00 $10.00
1.00 $ 0.00 $11.43
This field is much more smooth than the total cost field, so I was
hoping that pure minimization will do. Obviously, the flex cost is
much, much higher than the total cost because unsures are so much
more expensive. The flex cost field will also be less sensitive to
the {sp|h}am_cutoff parameters than the total cost field, because
there are no sudden cost jumps.
* Results are not great I need to experiment more before reporting
on them.
* I just committed:
weaktest.py: introduction of the flexcost measure
optimize.py: simplex optimization (needs Numeric python; sorry)
weakloop.py: run weaktest.py repeatedly under simplex optimization
Regards,
Rob Hooft
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/
More information about the Spambayes
mailing list