[Spambayes] Latest spammer trick stymied - QUESTION

Tim Peters tim.one at comcast.net
Mon Mar 31 19:37:16 EST 2003


[T. Alexander Popiel]
> ...
> This already happens to some extent, though the I think there could
> be better handling of the composite hostname and directory path...
> to wit, I suspect that adding the following tokens would help:
>
>   url:myspam.com
>   url:check.myspam.com
>   url:check.myspam.com/ad
>   url:check.myspam.com/ad/junk
>
> I haven't tested this yet, but I further suspect that I will have
> Tim Peters' problem: my results are already good enough that I won't
> be able to say anything conclusive about it.

Mining embedded URLs was the first tokenization enhancement added to the
project, and it instantly cut the false negative rate in half -- that
remains the single biggest win we ever got.  At first, it was fancier than
it is now.  The scheme got simpler over time, as testing showed no
significant difference in results as more gimmicks got thrown out.

Note that we actually generate more tokens than meet the eye for spam like:

"""
X-Message-Info: JGTYoYF78jEHjJx36Oi8+Q1OJDRSDidP
Received: from wildlife.com ([4.40.47.205]) by mc9-f10.bay6.hotmail.com with
	Microsoft SMTPSVC(5.0.2195.5600);	 Sun, 30 Mar 2003 23:44:18 -0800
Date: Sun, 30 Mar 2003 01:37:18 -0300
From: "Ella Schotte" <skoocea at wildlife.com>
To: <tim_one at email.msn.com>
Message-ID: <20030330013718.9ltGDlkp5jmJ at wildlife.com>
Content-Type: text/plain
Subject: with Daughter
Return-Path: skoocea at wildlife.com
X-OriginalArrivalTime: 31 Mar 2003 07:44:18.0807 (UTC)
	FILETIME=[56139870:01C2F759]


http://jeajeeceap.lewdmother.com
"""


The complete list of tokens generated by the Outlook client by default for
that is:

'cc:none'
'content-type:text/plain'
'from:addr:skoocea'
'from:addr:wildlife.com'
'from:name:ella schotte'
'header:Date:1'
'header:From:1'
'header:Message-ID:1'
'header:Received:1'
'header:Return-Path:1'
'header:Subject:1'
'header:To:1'
'message-id:@wildlife.com'
'noheader:abuse-reports-to'
'noheader:errors-to'
'noheader:importance'
'noheader:in-reply-to'
'noheader:mime-version'
'noheader:organization'
'noheader:reply-to'
'noheader:user-agent'
'noheader:x-abuse-info'
'noheader:x-complaints-to'
'noheader:x-face'
'proto:http'
'reply-to:none'
'sender:none'
'subject: '
'subject:Daughter'
'subject:with'
'to:2**0'
'to:addr:email.msn.com'
'to:addr:tim_one'
'to:no real name:2**0'
'url:com'
'url:jeajeeceap'
'url:lewdmother'
'x-mailer:none'

Currently, in my home classifier, only 7 of those have spamprobs outside of
(.4, .6), so 31 tokens are ignored.  If "minimal headers" becomes a popular
spam gimmick, that will boost the spamprobs of the assorted "noheader:xyz"
and "xyz:none" tokens, to the point where they're no longer ignored.




More information about the Spambayes mailing list