[Spambayes] More experiments with weaktest.py

Tim Peters tim.one@comcast.net
Mon Nov 11 00:59:05 2002


[Tim, notes that his mistake-only training works in the order msgs
 come in]

[Rob Hooft]
> I toyed with the idea, but that would involve parsing all messages once
> before starting, and sorting them on date. Putting them in a set to
> "randomize" the order is much easier, so I was lazy.

That's fine.  For purposes of comparing this against previous tests, I
expect it's even good, since they were randomized too.

> ...
> Hm. That sounds so enthousiastic that I just might commit what I have
> gone through this night.

You did, and I thank you!  Note that there were already three Simplex pkgs
linked from

    http://www.python.org/topics/scicomp/numbercrunching.html

but I know how much fun it is write such stuff again <wink>.

> Some more info:
>
>   * No, I have not used a "Simulated Annealing" or "Threshold Accepting"
>     yet. Please keep in mind that each step in the optimization takes
>     between 3 minutes (1 set on my home PC) and 15 minutes (10 sets on my
>     work PC). This would be way too costly. Just minimization it will be.

Understood.

>   * I tried to use "Simplex optimization" (let a multidimensional
>     triangle walk through phase space) on the "Total cost" parameter.
>     This was simply disastrous. Phase space consists of plateau regions
>     that are exactly flat, joined by huge ridges. Think about that one
>     spam that goes from a 0.11 to a 0.09 score: it will add $9.80 in one
>     bang to the cost. This field is impossible to optimize.

Yes, it's a sum of step functions in the end, and at every point "the
derivative" is either 0 or infinite, depending on where you are and which
direction you look.  Making a new "smooth" cost measure was thoroughly
appropriate:

>   * I designed a new "Flex cost" field. That one does away with the
>     "unsure cost". The cost of a message is 0.0 at its own cutoff, and
>     increases linearly towards its "false" cost at the other cutoff,
>     and increases further to the other end. Hm. Unreadable.

The code is clear enough, though.  What I didn't understand is why each term
in the flexcost is divided by the difference between the (fixed per run)
cutoff levels:   / (SPC - HC).  That seems to systematically penalize, e.g.,
ham_cutoff=.4 and spam_cutoff=0.8 compared to ham_cutoff=0.1 and
spam_cutoff=0.9 (the former divides every term by 0.4, the latter by 0.8).
In the limit, if someone wanted a binary classifier (ham_cutoff ==
spam_cutoff), any mistake would be charged an infinite penalty.

> A table:
>
>            Score    Spam with this   Ham with this
>                       score costs     score costs
>             0.00         $ 1.29          $ 0.00

It's hard to see where that comes from.  Assuming ham_cutoff is 0.2 and
spam_cutoff 0.9, and so a spam scoring 0.0 works out to $1 *
(.9-0.0)/(.9-.2) ?

>             0.20         $ 1.00          $ 0.00
>             0.55         $ 0.50          $ 5.00
>             0.90         $ 0.00          $10.00
>             1.00         $ 0.00          $11.43
>
>      This field is much more smooth than the total cost field, so I was
>      hoping that pure minimization will do. Obviously, the flex cost is
>      much, much higher than the total cost because unsures are so much
>      more expensive. The flex cost field will also be less sensitive to
>      the {sp|h}am_cutoff parameters than the total cost field, because
>      there are no sudden cost jumps.

Well, if ham_cutoff==spam_cutoff, then (as above) any mistake will cause a
DivideByZero exception, so it's sure sensitive there <wink>.  I suspect it
might work better if the "/(SPC-HC)" business were simply removed?

>    * Results are not great I need to experiment more before reporting
>      on them.
>    * I just committed:
>       weaktest.py: introduction of the flexcost measure
>       optimize.py: simplex optimization (needs Numeric python; sorry)
>       weakloop.py: run weaktest.py repeatedly under simplex optimization

I've been running weakloop.py over two sets of my c.l.py data while typing
this.  That's 2*2000 = 4000 ham, and 2*1400 = 2800 spam, for 6800 total
msgs.  It's been thru the whole business about 25 times now.  At the start,

Trained on 88 ham and 66 spam
fp: 0 fn: 0
Total cost: $30.80
Flex cost: $212.3120
x=0.5000 p=0.1000 s=0.4500 sc=0.900 hc=0.200 212.31

It's having a hard time doing better than that.  The best so far seems to be

Trained on 82 ham and 66 spam
fp: 0 fn: 0
Total cost: $29.60
Flex cost: $200.0924
x=0.5011 p=0.1026 s=0.4515 sc=0.901 hc=0.205 200.09

which is so close to the starting point that it's hard to believe it's
finding something "real".  It *does* seem to be in a nasty local minimum,
though, as the next attempt was:

Trained on 118 ham and 69 spam
fp: 1 fn: 0
Total cost: $47.20
Flex cost: $344.7334
x=0.4989 p=0.1038 s=0.4531 sc=0.900 hc=0.209 344.73

I'm afraid it looks like it's eventually going to converge on the most
delicate possible settings that barely manage to avoid that 1 FP.




More information about the Spambayes mailing list