[spambayes-dev] was date for new release ...
Seth Goodman
sethg at goodmanassociates.com
Mon Feb 5 01:43:51 CET 2007
skip at pobox.com wrote on Saturday, February 03, 2007 3:17 PM -0600:
> Seth> Another possible meta-token that might help detect word salad
> Seth> (probably what Skip had in mind):
>
> Seth> percentage of unique word tokens that are not significant
>
> I see a chicken-and-egg situation developing when we try to compute
> these sort of numbers. Start with an empty database. Train on a ham
> message. No words are significant at that point, so having no
> significant word tokens is a hammy clue. Train on a spam. By
> definition all words in the database at this point are significant,
> so only words not yet seen will be deemed not significant.
It definitely has chicken and egg properties.
>
> Lather, rinse, repeat.
>
> Maybe after you're done training on all available messages you can
> toss all these percentage tokens and make a second pass over your
> messages computing only those tokens. Are there better ways to
> compute tokens such as this which depend on the contribution of
> other messages in the database?
I hope so. This is fundamentally different from drawing an inference
from previously observed word frequencies. Numeric value meta-tokens
are not the result of binary experiments. They exist for every message,
whether ham or spam, and they are real numbers. We don't know their
underlying distribution. The problem is to estimate the probability
that a message that contains a token with a given numeric value is ham
or spam based on the values of that token observed in trained ham and
spam.
This is a very raw idea, not even half-baked. I think this problem
becomes tractable if we assume the tokens values are Gaussian
distributed, even if we believe they aren't. It should be possible to
estimate the likelihood that a given token value is from a spam message
based on the distribution of that token's value in both trained ham and
spam. If it's Gaussian, we only need to know the mean and variance of
each distribution.
If this turns out to work at all, we wouldn't need that much information
in the database. For each numeric value token you model this way, you
need at least the mean and variance for each of ham and spam. To
untrain a value, I think you could get away with keeping only the
intermediate values used to calculate variance, and I vaguely recall two
of them. If you want to support arbitrary real values, these are all
floats, with the possibility that the intermediate variables are double
precision.
--
Seth Goodman
More information about the spambayes-dev
mailing list