    Seth> Another possible meta-token that might help detect word salad
    Seth> (probably what Skip had in mind):

    Seth>   percentage of unique word tokens that are not significant

I see a chicken-and-egg situation developing when we try to compute these
sort of numbers.  Start with an empty database.  Train on a ham message.  No
words are significant at that point, so having no significant word tokens is
a hammy clue.  Train on a spam.  By definition all words in the database at
this point are significant, so only words not yet seen will be deemed not

Lather, rinse, repeat.

Maybe after you're done training on all available messages you can toss all
these percentage tokens and make a second pass over your messages computing
only those tokens.  Are there better ways to compute tokens such as this
which depend on the contribution of other messages in the database?


