[spambayes-dev] spammy subject lines

Tony Meyer tameyer at ihug.co.nz
Mon Oct 13 21:36:33 EDT 2003

> > So, I (and anyone else ;) should do timcv.py,
> That's right, and with -n10 if possible.

Is there a recommended number of messages in each bucket (or min and max
numbers)?  I think I remember seeing 500 mentioned at one point, but I can't
remember where (and am too lazy to search).

> Right.  Also table.py, because, unfortunately, cmp.py 
> predates the idea of unsures, and doesn't tell us anything
> about the effect on the unsure rate.

Would a better move be to update cmp.py so that it does know about unsures?
Or is this really just not worth the effort, in your opinion?

[explanation of stats cut]

I really should amalgamate your comments (here and previously) into the
TESTING.TXT file to make it easier for newcomers to understand the results.
Maybe I'll try and do a bit of that later on today.  Thanks for the
clarification :)

> > I *think* ;) that this is back to a slight win for the change...
> I agree, although seeing the details gives cause to worry 
> some about the effect on ham:  the ham sdev increased overall, and the 
> effects on ham mean and ham sdev varied wildly across runs.  OTOH, the
> "before" numbers for ham mean and ham sdev varied wildly across runs
> already.  That gives cause to worry some about the data <wink>.

This is the opposite of what you saw happening, too, wasn't it?  (That
business with spambayes-dev giving an extra token).  There's very little
list mail in the data I used, so that could be one reason (and is a reason
to worry about the data to a certain extent - I do *get* a lot of list mail,
but I don't keep it around, but then nor do I use spambayes to filter it
(the lists have so little spam coming through that it's not worth it)).

Of course, my data doesn't really tell us anything until we can compare it
to someone else's...hopefully the OP, at least, will give this a go.

=Tony Meyer

