[spambayes-dev] Tricky false positive: US states

T. Alexander Popiel popiel at wolfskeep.com
Tue Oct 7 15:27:11 EDT 2003

In message:  <16258.43595.88989.410608 at montanaro.dyndns.org>
             Skip Montanaro <skip at pobox.com> writes:
>I was thinking more along the lines of deleting "chicago" or "illinois" (but
>not necessarily both) if they were strongly correlated.

I think that the problem with this is that to find out that chicago and
illinois are correlated, you need to keep accurate records of each
separately, plus records of when they appeared together.

>Furthermore, I was thinking that the correlation would be done at training
>time, not scoring time.  That would probably wreak havoc with incremental
>training (though maybe you could keep a separate correlation database to
>assist there).

I also would think it'd be done at training time, but since I am a heavy
user of incremental training (training on almost every message as it
arrives), that doesn't relieve the record keeping aspects. :-(

- Alex

