[spambayes-dev] Tricky false positive: US states

Meyer, Tony T.A.Meyer at massey.ac.nz
Mon Oct 6 22:37:32 EDT 2003


[Tim's consultant metaphor cut]

[Skip]
> I like this non-technical explanation a lot.  I think the 
> hand-waving description on the website should incorporate this notion.

+1.  Maybe Anthony would do this if we pushed him nicely enough?

> This might be worth investigating.  Can't we compute the 
> correlation between two tokens by keeping track of how 
> frequently they appear in the same message?  If we know 
> "chicago" and "illinois" are very strongly correlated, we can 
> potentially choose to ignore one or the other.  This could 
> reduce the size of the database substantially

Wouldn't this increase the size of the database?  Rather than just
record how many times "chicago" and "illinois" appeared, we'd have to
have a count for how many times they appeared together, and appeared
with every other word in the database.  Or am I missing something?

I'm happy to run some tests if someone else writes the code :)

=Tony Meyer



More information about the spambayes-dev mailing list