[spambayes-dev] Tricky false positive: US states
Richie Hindle
richie at entrian.com
Tue Oct 7 03:32:47 EDT 2003
[Tim]
> The rub is that getting the same
> judgment from 100 consultants isn't *really* more reliable than
> getting it from one consultant unless the consultants are
> independent -- if they are independent, very high confidence is
> fully justified. In this case, the consultants are all related,
> biased in the same direction for a reason.
[Skip]
> This might be worth investigating. Can't we compute the correlation between
> two tokens by keeping track of how frequently they appear in the same
> message? If we know "chicago" and "illinois" are very strongly correlated,
> we can potentially choose to ignore one or the other.
That may well be useful, but it wouldn't have helped in the case of my US
states email. The state names tend to appear alone in a message (one
spammer comes from Illinois, another from California, another from Maine)
but then reinforce each other on the rare occasions that they *do* appear
together. They are correlated in the real world, not in my training set.
--
Richie Hindle
richie at entrian.com
More information about the spambayes-dev
mailing list