[Spambayes] 'sender' and 'reply-to' tokenising.

Tim Peters tim.one@comcast.net
Sat Nov 2 04:57:08 2002


[Tim, praising Anthonly's enthusiastic attempts at analysing test data]
> I'm tempted to drop them!  mean/sdev were useful under schemes with real
> systematic overlap between the population scores, but  chi-combining is
> so extreme that overlaps simply aren't due to random effects.

[Anthony Baxter]
> So we're back with the problem we had with the Graham method, that
> it's really really hard to analyse tokenizer changes because of the
> lack of meaningful test data?

The problem I had with Graham-combining is that the more and better the
training data you had, the more embarrassing its errors became:  the middle
ground kept getting smaller, and eventually everything scored as 0.0 or 1.0,
and whether right or wrong.  chi-combining reliably scores highly ambiguous
msgs near 0.5, and its middle ground is (a) very accurate about when it's
confused, and (b) doesn't degenerate as training data increases.

> Is it worth trying the tests with gary-combining to see if the tokenizer
> changes actually make things better or worse?
>
> I don't think we're going to see any "easy big wins" from the
> tokenizer - but trying to figure out whether incremental changes
> are positive or negative seems like it's going to be hard if
> we can only use fp/fn numbers.

The FP/FN/unsure rates are the only numbers that matter in the end, and
under chi-combining it's *much* easier to stare at mistakes and find
commonalities.  Given a reasonable amount of training data, errors almost
never score at 0.0 or 1.0 under chi, which makes it plausible that tokenizer
chnages can redeem them.  This requires more work but is more rewarding.
For example, it was easy to identify exactly what about tokenizing Reply-To
saved 3 FP in my python.org test, and that suggested a focused area for
further work.  Precisely because there are very likely no big wins
remaining, progress now has to come from thinking about mistakes, finding
cheap ways to avoid them, and then running tests to ensure that new gimmicks
don't hurt anything else.  As with the only good effect I found from
Reply-To in my python.org test, I expect most such gimmicks will boil down
to letting the classifier see more of the msg -- but not so much that highly
correlated words lead to extreme mistakes.

There's still a lot of header info we ignore by default, and we still ignore
almost everything in almost all HTML tags, and almost everything in almost
all non-text/* sections, so there's still plenty of room for small
improvements.  Looking for something that increases the mean spread by 0.1%
when the means are already 16 sdev apart is a waste of time now, though.
Looking for something that cuts an FP without hurting FN or unsure is
golden.

progress-is-harder-now-but-that's-a-sign-of-success-ly y'rs  - tim