[spambayes-dev] Tricky false positive: US states

Tue Oct 7 07:58:03 EDT 2003

    >> This might be worth investigating.  Can't we compute the correlation
    >> between two tokens by keeping track of how frequently they appear in
    >> the same message?  If we know "chicago" and "illinois" are very
    >> strongly correlated, we can potentially choose to ignore one or the
    >> other.  This could reduce the size of the database substantially

    Tony> Wouldn't this increase the size of the database?  Rather than just
    Tony> record how many times "chicago" and "illinois" appeared, we'd have
    Tony> to have a count for how many times they appeared together, and
    Tony> appeared with every other word in the database.  Or am I missing
    Tony> something?

I was thinking more along the lines of deleting "chicago" or "illinois" (but
not necessarily both) if they were strongly correlated.  Furthermore, I was
thinking that the correlation would be done at training time, not scoring
time.  That would probably wreak havoc with incremental training (though
maybe you could keep a separate correlation database to assist there).  I
assumed the core classifier would remain the same, just that the tokens in
the training database would (hopefully) be more independent predictors.

Skip