[Spambayes] Retraining

Jim McAtee jmcatee at mediaodyssey.com
Thu Feb 26 15:01:03 EST 2004


----- Original Message ----- 
From: "Tony Meyer" <tameyer at ihug.co.nz>
To: "'Fred Mertz'" <fred at lucy.com>; <spambayes at python.org>
Sent: Tuesday, February 24, 2004 11:38 PM
Subject: RE: [Spambayes] Retraining


>> Oh, ok.  Before I start from scratch, then, maybe I should
>> ask about spams with hundreds of random or gibberish words.
>> Do these muck up the databases?  They seem to be classified ok,
>> and the clues aren't any of these gibberish tokens, but I know
>> I trained on a few of these and it seems like they would skew
>> the statistics.  What's the recommendation on this type of spam?
>
>AFAIK, the jury is still out on this one.  One school of thought is that is
>the words really are randomly selected (from a dictionary, for example),
>then the highest chance is that you'll never have seen the word before and
>so it'll be ignored (or if you train on it, then that you'll never see it
>again, and it won't matter).  Then there's a chance that the word is spam
>(or if you train on it, that the next time it appears will be in spam).
>Finally there's the chance that the word is ham (appears next in ham).  So
>it's no big deal, and may even help classification.  Whether this is true or
>not is still open to question, I think.


I'm seeing a fair number of relatively targeted "random" words that are
helping to get quite a few messages just under the spam threshold.  If a
spammer is harvesting email addresses from a mailing list, especially a
technical one, this technique is particulary easy - and dare I say,
particularly effective.  They can even throw words back to you from one of
your own postings.

>Finally there's the chance that the word is ham (appears next in ham).  So
>it's no big deal, and may even help classification.

I'm not sure I understand how classifying ham words as spam can have any
possible benefit...






More information about the Spambayes mailing list