[Spambayes] Latest spammer trick stymied - QUESTION

bill parducci bill at parducci.net
Mon Mar 31 17:15:02 EST 2003

T. Alexander Popiel wrote:
>>take the example: http://check.myspam.com/ad/junk?random=fsldkjflksj
> That example would yield the tokens:
>   proto:http
>   url:check
>   url:myspam
>   url:com
>   url:ad
>   url:junk
>   url:random
>   url:fsldkjflksj

<bayesian ignorance shields up>
doesn't the degree of granularity here dilute the information? in other 
words, 'com' and 'junk' are extremely common, while 'myspam.com' less so 
and 'check.myspam.com' completely unique. since neutral tokens are 
ignored, words like these may not be considered, while the following 
most likely would be considered:

>   url:myspam.com
>   url:check.myspam.com
>   url:check.myspam.com/ad
>   url:check.myspam.com/ad/junk

therefore, in the case of url parsing, it would seem that less 
[granularity] is more [accuracy].



More information about the Spambayes mailing list