[Spambayes] Slice o' life

Tim Peters tim.one@comcast.net
Wed Oct 16 05:33:01 2002


[Tim]
> ...
> This has been my first chance to play with mining the headers for real:
>
> """
> [Tokenizer]
> mine_received_headers: True
> basic_header_tokenize: True
>
> [Classifier]
> use_chi_squared_combining: True
> """

And now I note the first systematic weakness:  I scored my own "spam"
folder, and discovered 5 spam with scores of 0.0.  They all have one thing
in common:  they're spam that SpamAssassin didn't catch, and came to me via
a python.org mailing list.

It turns out that python.org, Mailman, and SpamAssassin, put sooooooooo many
unique "Hey, I had my fingers this!" clues in the headers that virtually any
message coming thru python.org has a relatively huge collection of
killer-strong ham clues (just listing headers containing such clues):

Received: from mail.python.org (mail.python.org [12.155.117.29]) ...
Received: from localhost.localdomain ([127.0.0.1] helo=mail.python.org)
	by mail.python.org with esmtp (Exim 4.05)	...
Received: from [168.103.194.76] (helo=wvwrbn)	by mail.python.org ...
Subject: [Python-Help] Mp3sa  hwnf
Sender: python-help-admin@python.org
To: help@python.org
Errors-to: python-help-admin@python.org
Precedence: bulk
X-BeenThere: python-help@python.org
X-warning: 168.103.194.76 in blacklist at list.dsbl.org
 (http://dsbl.org/listing.php?168.103.194.76)
X-Spam-Status: No, hits=3.8 required=5.0
tests=BASE64_ENC_TEXT,CTYPE_JUST_HTML
X-Spam-Level: ***
X-Mailman-Version: 2.0.13 (101270)
List-Post: <mailto:python-help@python.org>
List-Subscribe: <http://mail.python.org/mailman/listinfo/python-help>,
	<mailto:python-help-request@python.org?subject=subscribe>
List-Unsubscribe: <http://mail.python.org/mailman/listinfo/python-help>,
	<mailto:python-help-request@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/mailman/private/python-help/>
List-Help: <mailto:python-help-request@python.org?subject=help>
List-Id: Expert volunteers answer Python-related questions
 <python-help.python.org>

This was an HTML msg that appeared to be pushing a Turkish MP3 site.  It's
not a dead-easy msg to score, but I also got a copy from another email
account, and it scored 0.64 there (instead of 0 via python.org).  I guess I
go back to ignoring various header lines again ...