[spambayes-dev] Tricky false positive: US states
T.A.Meyer at massey.ac.nz
Mon Oct 6 22:37:32 EDT 2003
[Tim's consultant metaphor cut]
> I like this non-technical explanation a lot. I think the
> hand-waving description on the website should incorporate this notion.
+1. Maybe Anthony would do this if we pushed him nicely enough?
> This might be worth investigating. Can't we compute the
> correlation between two tokens by keeping track of how
> frequently they appear in the same message? If we know
> "chicago" and "illinois" are very strongly correlated, we can
> potentially choose to ignore one or the other. This could
> reduce the size of the database substantially
Wouldn't this increase the size of the database? Rather than just
record how many times "chicago" and "illinois" appeared, we'd have to
have a count for how many times they appeared together, and appeared
with every other word in the database. Or am I missing something?
I'm happy to run some tests if someone else writes the code :)
More information about the spambayes-dev