[Spambayes] Some more experiences with the Outlook plugin

Tue Nov 12 05:55:19 2002

[Tim]
>> In the list you gave below, there are very few hapaxes (I recognize
>> them from the probabilities; I should probably add code to the client
>> to display the raw counts too):

[MarkH]
> certainly would be useful.

That's been checked in now.

> Without the maths background, I find it interesting to ignorantly
> speculate on these ratios.  Tim's analysis:

>> '(and' is nearly "33 times closer" to 0 than '"remove"' is to 1,
>> and that makes the accidental appearance of a ham word in spam much
>> more powerful than the systematic appearance of a spam word in spam.

> makes me wonder why the classifier can't exploit the ham:spam
> ratio to give weighted results.

I think it's already doing the best it can here.  It's like I've met a
thousand Americans and 2 Australians, so from all I've *seen* I have to
conclude you're all beer-swilling, Ducati-riding, chain-smoking pigs.  But
that's really not enough evidence for me to *marry* an Australian, just
enough to think highly of 'em <wink>.

> Or from another POV, what would happen if we artificially
> boosted the ratio by training on each spam multiple times?

Nobody knows.  The "by-counting" spamprob estimate wouldn't change at all:
that's already computed by ratios instead of by absolute counts.  If a word
appears in 3 of 4 spam, it gets exactly the same by-counting estimate as a
word that appears in 15,000,000 of 20,000,000 spam.  The difference would be
solely in how much the Bayesian adjustment pushed the by-counting estimate
towards 0.5:  the greater the total number of msgs a word has been seen in,
the more willing the Bayesian adjustment is to leave the by-counting
estimate alone.

Much the same effect *could* be gotten via reducing option
unknown_word_strength instead.  That also makes the Bayesian adjustment more
willing to take the by-counting estimate at face value.

Most of the people who helped pick a good default value for
unknown_word_strength didn't have a strong imbalance in ham:spam.  Maybe you
need a lower value, but I expect it's much better for such people *not* to
train on so much ham.  Training on small random samples, plus mistakes and
unsures, may well be a better approach.

If you've been following the latest experiments, it turns out you can get
very good results with a tiny fraction of the msgs people *have* been
training on.  My personal classifier right now has been trained on only
about 100 msgs total, close to 1:1 ham:spam.  This has weaknesses too, but
not nearly as bad as I guessed in advance (it doesn't seem *any* more prone
to making flat-out mistakes, but the Unsures are hilarious <wink>).

> I speculate due to my experience with these large ratios, and the
> fact that *every* one of these mails came through my Inbox.  Many
> messages are from python.org's mailman - thus, the *true* ratio of
> ham:spam through my mail account is much higher than the ham:spam
> ratio left once the mailing list traffic is removed.  Even though
> the total spam is the same, the system will score better or worse
> depending on the amount of ham I throw at it.  It isn't intuitive
> to me why this need be so.

Only because if it has a lot more ham than spam, it has much more reason to
be confident about hamprobs than spamprobs.  I suppose the Bayesian
adjustment could be fiddled so that it didn't "believe" it *could* be more
confident about either class than is justified by the class for which it has
the least amount of evidence.  I'm not exactly sure of the details, but it's
inuitively clear to me so will be obvious when I wake up <wink>.  That would
prevent the strange result in the example, but:

1. Training on the spam again still wouldn't do you much good, because
   if the ratio was 18:1 before training, it would still be close to
   18:1 after training, so it still wouldn't have much reason to
   "believe" the new spamprobs.

and

2. It would make most of the ham you trained on essentially a waste
   of time and space:  by construction, it wouldn't believe the ham
   stats any more than it believed the spam stats.

We know a lot more at this point about how the system behaves if you don't
have a strong imbalance.