[Spambayes] Latest spammer trick stymied - QUESTION
Tim Peters
tim.one at comcast.net
Mon Mar 31 20:28:43 EST 2003
[bill parducci]
> <bayesian ignorance shields up>
> doesn't the degree of granularity here dilute the information? in other
> words, 'com' and 'junk' are extremely common, while 'myspam.com' less so
> and 'check.myspam.com' completely unique. since neutral tokens are
> ignored, words like these may not be considered, while the following
> most likely would be considered:
>
>> url:myspam.com
That's decent, but likely no better than url:myspam.
>> url:check.myspam.com
>> url:check.myspam.com/ad
>> url:check.myspam.com/ad/junk
Those are probably one-shot hapaxes (i.e., worthless, except for catching
copies of the same spam). If you own a domain xyz.com, then you can make up
all the ABC.xyz.com targets you like, and spammers generally do. ABC
doesn't repeat often except in copies of the same spam.
> therefore, in the case of url parsing, it would seem that less
> [granularity] is more [accuracy].
Test and measure.
More information about the Spambayes
mailing list