[Spambayes] Tokenizing ideas (images, attachments)

Harri Pesonen harri.pesonen at wicom.com
Wed Aug 27 11:22:54 EDT 2003


Yeah, I read that FAQ, I'm currently just learning Python. I don't see
any url:tokens, I use the Outlook plugin, perhaps the problem is there,
it does not use the HTMLBody property?

Btw, what does header:Received:1 and header:User-Agent:1 mean? Does
SpamBayes have an internal black list? Also Date:1, From:1,
MIME-Version:1 etc, what do they mean? :-)

Spam Score: 0.997882

word                                spamprob         #ham  #spam
'*H*'                               5.3187e-005         -      -
'*S*'                               0.995817            -      -
'subjectcharset:iso-8859-1'         0.15272            57     12
'subject: - '                       0.241103           64     24
'reply-to:none'                     0.340317          349    214
'header:Date:1'                     0.616644          232    444
'header:From:1'                     0.617705          232    446
'header:MIME-Version:1'             0.61773           207    398
'to:no real name:2**0'              0.648347          170    373
'header:Return-Path:1'              0.680694          175    444
'header:Message-ID:1'               0.6998            151    419
'to:addr:merlin.fi'                 0.723659          101    315
'subject:!'                         0.733392           13     43
'from:addr:aboydhd'                 0.82569             0      1
'from:addr:merlin.net.au'           0.82569             0      1
'from:name:amalia boyd'             0.82569             0      1
'message-id:@merlin.net.au'         0.82569             0      1
'subject:Chance'                    0.82569             0      1
'subject:Last'                      0.82569             0      1
'subject:blowout'                   0.82569             0      1
'subject:inventory'                 0.82569             0      1
'subject:  '                        0.916657            4     55
'subject:Citrate'                   0.924304            0      3
'subject:Sildenafil'                0.924304            0      3
'header:Received:1'                 0.958977            5    145
'header:User-Agent:1'               0.986969            0     20

Message Stream:

X-MS-Mail-Gibberish: Microsoft Mail Internet Headers Version 2.0
Received: from thing.de ([67.122.162.175]) by postman.merlin.fi with
Microsoft
	SMTPSVC(5.0.2195.6713); Wed, 27 Aug 2003 01:37:36 +0300
User-Agent: Mozilla/5.001 (windows; U; NT4.0; en-us) Gecko/25250101
From: "Amalia Boyd" <aboydhd at merlin.net.au>
Date: Tue, 26 Aug 2003 18:33:58 +0000
Message-ID: <3F4BA816.730F2157 at merlin.net.au>
To: harri.pesonen at merlin.fi
MIME-Version: 1.0
Subject:
=?iso-8859-1?b?TGFzdCBDaGFuY2UgLSBTaWxkZW5hZmlsIENpdHJhdGUgIGludmVudG9ye
SBibG93b3V0IQ==?=
Content-Type: text/html
Content-Transfer-Encoding: 8bit
Return-Path: aboydhd at merlin.net.au
X-OriginalArrivalTime: 26 Aug 2003 22:37:36.0988 (UTC)
	FILETIME=[A637F9C0:01C36C22]

Message Tokens:

33 unique tokens

'cc:none'
'content-type:text/plain'
'from:addr:aboydhd'
'from:addr:merlin.net.au'
'from:name:amalia boyd'
'header:Date:1'
'header:From:1'
'header:MIME-Version:1'
'header:Message-ID:1'
'header:Received:1'
'header:Return-Path:1'
'header:Subject:1'
'header:To:1'
'header:User-Agent:1'
'message-id:@merlin.net.au'
'reply-to:none'
'sender:none'
'subject: '
'subject: '
'subject: - '
'subject:!'
'subject:Chance'
'subject:Citrate'
'subject:Last'
'subject:Sildenafil'
'subject:blowout'
'subject:inventory'
'subjectcharset:iso-8859-1'
'to:2**0'
'to:addr:harri.pesonen'
'to:addr:merlin.fi'
'to:no real name:2**0'
'x-mailer:none'

-----Original Message-----
From: Meyer, Tony [mailto:T.A.Meyer at massey.ac.nz] 
Sent: 27. elokuuta 2003 10:08
To: Harri Pesonen; spambayes at python.org
Subject: RE: [Spambayes] Tokenizing ideas (images, attachments)


> Why not tokenize image URLs?
[...]
> While SpamBayes detected this message just fine,

There's a reason why not ;)

> Many times the message is empty or almost
> empty, containing only an image URL.

Not that any URL, including image ones, is tokenized.  If you look at
the clues for a message like the one you used as an example, you should
see some url: tokens.

It has been suggested that tokenizing (textual) information at the end
of the URL would be worthwhile (this includes a token if the URL 404s).
We tested this out (look at the urlslurper.py file), but didn't have
enough people testing to integrate it into the main code (as a
default-to-off option).  Death2Spam (see the related page) does this,
though, and Richard swears by it.

In any case, the best thing is to try these (or any other) ideas out.
See FAQ 6.1:

<file:///D:/cvs/spambayes/website/faq.html#why-don-t-you-implement-cool-
tokenizer-trick-x>

=Tony Meyer



More information about the Spambayes mailing list