[spambayes-dev] Tricky false positive: US states

Meyer, Tony T.A.Meyer at massey.ac.nz
Tue Oct 7 22:47:04 EDT 2003


> Another possibility is to partition words into equivalence 
> classes based on human knowledge (such as we applied to 
> Richie's example), picking an arbitrary member of each class 
> as its (fixed) canonical representative, and replacing each 
> word with its class's representative.

I suppose one way to do this would be to do a dictionary lookup for
tokens, and store the token as the definition of the word - if there are
multiple definitions, then it would be stored multiple times (unless
someone's clever enough to figure out the context). i.e:

"alabama" -> "A state of the United States"
"alaska" -> "A state of the United States"
"program" -> "A listing of the order of events", "A scheduled radio or
television show", "A course of academic study; a curriculum", "A set of
coded instructions that enables a machine, especially a computer, to
perform a desired sequence of operations"
"schedule" -> "A listing of the order of events", ...

It would be interesting to try out some of these ideas, even given that
token independence hasn't been shown to help.  If no-one has the time to
write anything now, we could add a summary of this thread to the
NEWTRICKS.txt file.

=Tony Meyer



More information about the spambayes-dev mailing list