tameyer at ihug.co.nz
Wed Feb 25 01:38:03 EST 2004
> Oh, ok. Before I start from scratch, then, maybe I should
> ask about spams with hundreds of random or gibberish words.
> Do these muck up the databases? They seem to be classified ok,
> and the clues aren't any of these gibberish tokens, but I know
> I trained on a few of these and it seems like they would skew
> the statistics. What's the recommendation on this type of spam?
AFAIK, the jury is still out on this one. One school of thought is that is
the words really are randomly selected (from a dictionary, for example),
then the highest chance is that you'll never have seen the word before and
so it'll be ignored (or if you train on it, then that you'll never see it
again, and it won't matter). Then there's a chance that the word is spam
(or if you train on it, that the next time it appears will be in spam).
Finally there's the chance that the word is ham (appears next in ham). So
it's no big deal, and may even help classification. Whether this is true or
not is still open to question, I think.
I can't be bothered hand-selecting the email that I use to train, so,
personally, I just use whatever comes up and have no idea whether there is
this 'word salad' there or not. Basically, at the moment, it's up to you.
Sorry this isn't more help!
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.
More information about the Spambayes