[spambayes-dev] Tricky false positive: US states
skip at pobox.com
Mon Oct 6 15:50:34 EDT 2003
Tim> You can consider spambayes as asking a large number of consultants
Tim> (tokens) whether they think your new message is spam. In fact,
Tim> with a little squinting, you can view most learning algorithms that
Tim> way. The strength of a spamprob (its distance from the neutral
Tim> 0.5) is a measure of how confident "a consultant" is about their
Tim> judgment. If one consultant says "well, it looks spammy to me, but
Tim> I wouldn't bet my life on it", and that's all you know, you're
Tim> probably not willing to bet anything that they're right (and a
Tim> single spamprob of 0.73 is indeed in the Unsure range for most
Tim> people). But if 100 consultants all say that same thing, any
Tim> learning algorithm (including a real person!) is going to be quite
Tim> confident that the odds of them all being wrong are tiny.
I like this non-technical explanation a lot. I think the hand-waving
description on the website should incorporate this notion.
Tim> That's what happened here. The rub is that getting the same
Tim> judgment from 100 consultants isn't *really* more reliable than
Tim> getting it from one consultant unless the consultants are
Tim> independent -- if they are independent, very high confidence is
Tim> fully justified. In this case, the consultants are all related,
Tim> biased in the same direction for a reason.
This might be worth investigating. Can't we compute the correlation between
two tokens by keeping track of how frequently they appear in the same
message? If we know "chicago" and "illinois" are very strongly correlated,
we can potentially choose to ignore one or the other. This could reduce the
size of the database substantially, and also work toward a situation where
we believed more strongly -- with some justification -- that our consultants
recommendations were accurate; that a politician wasn't paying them off
behind the scenes, figuratively speaking.
It would appear that this is an O(n*n) problem, since to accurately decide
correlation between any two tokens we have to consider how each token
correlates with all others. The problem size can probably be simplified in
various ways to avoid performing a full comparison.
More information about the spambayes-dev