[Spambayes] cmp.py with mean and dev comparison

Tim Peters tim.one@comcast.net
Sun, 22 Sep 2002 14:30:11 -0400


[Brad Clements]
> I hacked rates.py to extract the ham/spam mean and sdev from each run

Cool!

> then changed cmp.py to output the means and changes, sdev and
> changes for each run.
>
> Note that the mean/sdev and diffs are output under "false
> positive/false negative", but the numbers aren't for the "false.." part,
> just lined up with the run.
>
> Its ugly,

It's not ugly, but it's too wide.  Print the new stuff on new lines of their
own instead.

> but in my case is shows a decrease in ham mean (I like), but also a
> decrease in spam mean (don't like).

If the former decrease is larger than the latter decrease, the overall
separation increases, and that's the important thing (provided you believe
increasing separation is a good thing <wink>).

> But for both ham and spam, enabling recv lines has also reduced
> the sdev, which I think is a good step.

Me too.

> Or .. is this over-analyzing?

You must be joking <wink>.

[horribly line-wrapped too-wide output elided]

> ===================================================================
> RCS file: /cvsroot/spambayes/spambayes/rates.py,v
> retrieving revision 1.6
> diff -r1.6 rates.py
> 41c41,42
> <     interesting = filter(lambda line: line.startswith('-> '), ifile)
> ---
> >     interesting = filter(lambda line: line.startswith('-> ') or (
> >         line.find('sample sdev') != -1), ifile)

Don't do that:  change the lines you want to suck out to begin with '-> '
instead, like all the other lines this script processes.  Else it becomes an
unmaintainable mass of ad hoc and error-prone "pattern matching" rules.  It
*used* to do that, and I spent a fair amount of time changing all the
"interesting lines" to start with "-> " instead.  More, I changed the
statistics lines to start with "-> <stat> ".

IOW, just make the new code look as much as possible like the code that's
already there.  I think this is a good change if you do that.