[Spambayes] More experiments with weaktest.py

Sun Nov 10 12:11:46 2002

Tim Peters wrote:
> [Rob Hooft]
> 
>>These were results of weaktest with default parameters:
> 
> 
> Very interesting!  I'll have to try that too.  Note that in my live email
> experiment here, I'm (except for the very start) also scoring/training msgs
> in (with small lapses) the order they arrive.  It's been reported before
> that this helps; although I still haven't run a controlled experiment on
> that, my *impression* is that it does help.

I toyed with the idea, but that would involve parsing all messages once 
before starting, and sorting them on date. Putting them in a set to 
"randomize" the order is much easier, so I was lazy.

> Setting ham_cutoff as low as 10 is for the
> truly paranoid <0.9 wink>.

Very much so. For my "production" systems, I have ham_cutoff at 40...

> I hope you're at least gaining some respect for how much work went into
> picking the defaults <wink>.

I was just arriving when it happened. But that was on a completely 
different classifier, so I'm still convinced these need to be thoroughly 
tested.

>>I am back with the defaults, but I'd still like to do an automated
>>optimization of everything simultaneously. Might try that.

> Now *that* could be a useful system regardless of scheme.  I've tended to do
> hill-climbing across one dimension at a time, occasionally moving batches of
> params random amounts at once (to see whether that kicks it out of a
> stubborn local minimum).

Hm. That sounds so enthousiastic that I just might commit what I have 
gone through this night. Some more info:

  * No, I have not used a "Simulated Annealing" or "Threshold Accepting"
    yet. Please keep in mind that each step in the optimization takes
    between 3 minutes (1 set on my home PC) and 15 minutes (10 sets on my
    work PC). This would be way too costly. Just minimization it will be.
  * I tried to use "Simplex optimization" (let a multidimensional
    triangle walk through phase space) on the "Total cost" parameter.
    This was simply disastrous. Phase space consists of plateau regions
    that are exactly flat, joined by huge ridges. Think about that one
    spam that goes from a 0.11 to a 0.09 score: it will add $9.80 in one
    bang to the cost. This field is impossible to optimize.
  * I designed a new "Flex cost" field. That one does away with the
    "unsure cost". The cost of a message is 0.0 at its own cutoff, and
    increases linearly towards its "false" cost at the other cutoff,
    and increases further to the other end. Hm. Unreadable. A table:

           Score    Spam with this   Ham with this
                      score costs     score costs
            0.00         $ 1.29          $ 0.00
            0.20         $ 1.00          $ 0.00
            0.55         $ 0.50          $ 5.00
            0.90         $ 0.00          $10.00
            1.00         $ 0.00          $11.43

     This field is much more smooth than the total cost field, so I was
     hoping that pure minimization will do. Obviously, the flex cost is
     much, much higher than the total cost because unsures are so much
     more expensive. The flex cost field will also be less sensitive to
     the {sp|h}am_cutoff parameters than the total cost field, because
     there are no sudden cost jumps.
   * Results are not great I need to experiment more before reporting
     on them.
   * I just committed:
      weaktest.py: introduction of the flexcost measure
      optimize.py: simplex optimization (needs Numeric python; sorry)
      weakloop.py: run weaktest.py repeatedly under simplex optimization

Regards,

Rob Hooft
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/