[spambayes-dev] Tricky false positive: US states
Paul Wagland
spambayes at kungfoocoder.org
Tue Oct 7 08:14:48 EDT 2003
On Tue, 2003-10-07 at 13:58, Skip Montanaro wrote:
> >> This might be worth investigating. Can't we compute the correlation
> >> between two tokens by keeping track of how frequently they appear in
> >> the same message? If we know "chicago" and "illinois" are very
> >> strongly correlated, we can potentially choose to ignore one or the
> >> other. This could reduce the size of the database substantially
>
> Tony> Wouldn't this increase the size of the database? Rather than just
> Tony> record how many times "chicago" and "illinois" appeared, we'd have
> Tony> to have a count for how many times they appeared together, and
> Tony> appeared with every other word in the database. Or am I missing
> Tony> something?
>
> I was thinking more along the lines of deleting "chicago" or "illinois" (but
> not necessarily both) if they were strongly correlated. Furthermore, I was
> thinking that the correlation would be done at training time, not scoring
> time. That would probably wreak havoc with incremental training (though
> maybe you could keep a separate correlation database to assist there). I
> assumed the core classifier would remain the same, just that the tokens in
> the training database would (hopefully) be more independent predictors.
Yes, but the problem as I understand it is that in the training list the
states were not correlated. So no amount of high correlation removal
would help, simply because it is only the occasional HAM that does have
the correlation.
That is, in the SPAM it might say "order now from western australia" or
"order now from new south wales". Assume that people talk about these
states a lot ;-) Then under you scheme we would drop either or western
or australia, and two of new, south and wales. If I then go and list the
states of Australia
Western Australia, Queensland, New South Wales, Victoria, South
Australia and Tasmania (plus two territories, Northern Territory and the
A.C.T.)
The this would still be marked as spammy, simply because western
australia does not have a high correlation with new south wales...
This is what the original problem was.
I don't see how this can be easily solved... not without introducing a
significant weakness which could be attacked...
Just my two cents,
Paul
More information about the spambayes-dev
mailing list