[Spambayes] Tokenizing ideas (images, attachments)
Harri Pesonen
harri.pesonen at wicom.com
Wed Aug 27 11:22:54 EDT 2003
Yeah, I read that FAQ, I'm currently just learning Python. I don't see
any url:tokens, I use the Outlook plugin, perhaps the problem is there,
it does not use the HTMLBody property?
Btw, what does header:Received:1 and header:User-Agent:1 mean? Does
SpamBayes have an internal black list? Also Date:1, From:1,
MIME-Version:1 etc, what do they mean? :-)
Spam Score: 0.997882
word spamprob #ham #spam
'*H*' 5.3187e-005 - -
'*S*' 0.995817 - -
'subjectcharset:iso-8859-1' 0.15272 57 12
'subject: - ' 0.241103 64 24
'reply-to:none' 0.340317 349 214
'header:Date:1' 0.616644 232 444
'header:From:1' 0.617705 232 446
'header:MIME-Version:1' 0.61773 207 398
'to:no real name:2**0' 0.648347 170 373
'header:Return-Path:1' 0.680694 175 444
'header:Message-ID:1' 0.6998 151 419
'to:addr:merlin.fi' 0.723659 101 315
'subject:!' 0.733392 13 43
'from:addr:aboydhd' 0.82569 0 1
'from:addr:merlin.net.au' 0.82569 0 1
'from:name:amalia boyd' 0.82569 0 1
'message-id:@merlin.net.au' 0.82569 0 1
'subject:Chance' 0.82569 0 1
'subject:Last' 0.82569 0 1
'subject:blowout' 0.82569 0 1
'subject:inventory' 0.82569 0 1
'subject: ' 0.916657 4 55
'subject:Citrate' 0.924304 0 3
'subject:Sildenafil' 0.924304 0 3
'header:Received:1' 0.958977 5 145
'header:User-Agent:1' 0.986969 0 20
Message Stream:
X-MS-Mail-Gibberish: Microsoft Mail Internet Headers Version 2.0
Received: from thing.de ([67.122.162.175]) by postman.merlin.fi with
Microsoft
SMTPSVC(5.0.2195.6713); Wed, 27 Aug 2003 01:37:36 +0300
User-Agent: Mozilla/5.001 (windows; U; NT4.0; en-us) Gecko/25250101
From: "Amalia Boyd" <aboydhd at merlin.net.au>
Date: Tue, 26 Aug 2003 18:33:58 +0000
Message-ID: <3F4BA816.730F2157 at merlin.net.au>
To: harri.pesonen at merlin.fi
MIME-Version: 1.0
Subject:
=?iso-8859-1?b?TGFzdCBDaGFuY2UgLSBTaWxkZW5hZmlsIENpdHJhdGUgIGludmVudG9ye
SBibG93b3V0IQ==?=
Content-Type: text/html
Content-Transfer-Encoding: 8bit
Return-Path: aboydhd at merlin.net.au
X-OriginalArrivalTime: 26 Aug 2003 22:37:36.0988 (UTC)
FILETIME=[A637F9C0:01C36C22]
Message Tokens:
33 unique tokens
'cc:none'
'content-type:text/plain'
'from:addr:aboydhd'
'from:addr:merlin.net.au'
'from:name:amalia boyd'
'header:Date:1'
'header:From:1'
'header:MIME-Version:1'
'header:Message-ID:1'
'header:Received:1'
'header:Return-Path:1'
'header:Subject:1'
'header:To:1'
'header:User-Agent:1'
'message-id:@merlin.net.au'
'reply-to:none'
'sender:none'
'subject: '
'subject: '
'subject: - '
'subject:!'
'subject:Chance'
'subject:Citrate'
'subject:Last'
'subject:Sildenafil'
'subject:blowout'
'subject:inventory'
'subjectcharset:iso-8859-1'
'to:2**0'
'to:addr:harri.pesonen'
'to:addr:merlin.fi'
'to:no real name:2**0'
'x-mailer:none'
-----Original Message-----
From: Meyer, Tony [mailto:T.A.Meyer at massey.ac.nz]
Sent: 27. elokuuta 2003 10:08
To: Harri Pesonen; spambayes at python.org
Subject: RE: [Spambayes] Tokenizing ideas (images, attachments)
> Why not tokenize image URLs?
[...]
> While SpamBayes detected this message just fine,
There's a reason why not ;)
> Many times the message is empty or almost
> empty, containing only an image URL.
Not that any URL, including image ones, is tokenized. If you look at
the clues for a message like the one you used as an example, you should
see some url: tokens.
It has been suggested that tokenizing (textual) information at the end
of the URL would be worthwhile (this includes a token if the URL 404s).
We tested this out (look at the urlslurper.py file), but didn't have
enough people testing to integrate it into the main code (as a
default-to-off option). Death2Spam (see the related page) does this,
though, and Richard swears by it.
In any case, the best thing is to try these (or any other) ideas out.
See FAQ 6.1:
<file:///D:/cvs/spambayes/website/faq.html#why-don-t-you-implement-cool-
tokenizer-trick-x>
=Tony Meyer
More information about the Spambayes
mailing list