[spambayes-dev] Tricky false positive: US states
T.A.Meyer at massey.ac.nz
Tue Oct 7 22:47:04 EDT 2003
> Another possibility is to partition words into equivalence
> classes based on human knowledge (such as we applied to
> Richie's example), picking an arbitrary member of each class
> as its (fixed) canonical representative, and replacing each
> word with its class's representative.
I suppose one way to do this would be to do a dictionary lookup for
tokens, and store the token as the definition of the word - if there are
multiple definitions, then it would be stored multiple times (unless
someone's clever enough to figure out the context). i.e:
"alabama" -> "A state of the United States"
"alaska" -> "A state of the United States"
"program" -> "A listing of the order of events", "A scheduled radio or
television show", "A course of academic study; a curriculum", "A set of
coded instructions that enables a machine, especially a computer, to
perform a desired sequence of operations"
"schedule" -> "A listing of the order of events", ...
It would be interesting to try out some of these ideas, even given that
token independence hasn't been shown to help. If no-one has the time to
write anything now, we could add a summary of this thread to the
More information about the spambayes-dev