[spambayes-dev] Tricky false positive: US states
Skip Montanaro
skip at pobox.com
Tue Oct 7 07:58:03 EDT 2003
>> This might be worth investigating. Can't we compute the correlation
>> between two tokens by keeping track of how frequently they appear in
>> the same message? If we know "chicago" and "illinois" are very
>> strongly correlated, we can potentially choose to ignore one or the
>> other. This could reduce the size of the database substantially
Tony> Wouldn't this increase the size of the database? Rather than just
Tony> record how many times "chicago" and "illinois" appeared, we'd have
Tony> to have a count for how many times they appeared together, and
Tony> appeared with every other word in the database. Or am I missing
Tony> something?
I was thinking more along the lines of deleting "chicago" or "illinois" (but
not necessarily both) if they were strongly correlated. Furthermore, I was
thinking that the correlation would be done at training time, not scoring
time. That would probably wreak havoc with incremental training (though
maybe you could keep a separate correlation database to assist there). I
assumed the core classifier would remain the same, just that the tokens in
the training database would (hopefully) be more independent predictors.
Skip
More information about the spambayes-dev
mailing list