[Spambayes] Latest spammer trick stymied - QUESTION

Tim Peters tim.one at comcast.net
Mon Mar 31 20:28:43 EST 2003


[bill parducci]
> <bayesian ignorance shields up>
> doesn't the degree of granularity here dilute the information? in other
> words, 'com' and 'junk' are extremely common, while 'myspam.com' less so
> and 'check.myspam.com' completely unique. since neutral tokens are
> ignored, words like these may not be considered, while the following
> most likely would be considered:
>
>>   url:myspam.com

That's decent, but likely no better than url:myspam.

>>   url:check.myspam.com
>>   url:check.myspam.com/ad
>>   url:check.myspam.com/ad/junk

Those are probably one-shot hapaxes (i.e., worthless, except for catching
copies of the same spam).  If you own a domain xyz.com, then you can make up
all the ABC.xyz.com targets you like, and spammers generally do.  ABC
doesn't repeat often except in copies of the same spam.

> therefore, in the case of url parsing, it would seem that less
> [granularity] is more [accuracy].

Test and measure.




More information about the Spambayes mailing list