[spambayes-dev] was [Spambayes] date for new release to handleimage spam?
sethg at goodmanassociates.com
Sat Feb 3 21:48:24 CET 2007
Another possible meta-token that might help detect word salad
(probably what Skip had in mind):
percentage of unique word tokens that are not significant
Whether or not this would help classify word salad better is
anyone's guess. I would hope that your own correspondents have
some messages in the training set, so a larger fraction of their
obscure words would be significant clues than you'd expect of
random text from other sources.
Using a percentage rather than an absolute number may avoid bias
towards large or small messages. Then again, having both
percentage and total number versions of this meta-token may prove
useful for some users' training sets, as their legitimate mail may
tend towards large or small messages. If one version or the other
is not useful for an end user, that meta-token will probably turn
out to not be significant and will be excluded from the overall
score. Using meta-information is a little scary, since the
underlying tokens already contribute to the overall spam score.
I think the trick is to devise meta-tokens that describe overall
message characteristics and are relatively independent of
individual token scores.
More information about the spambayes-dev