[Spambayes] how spambayes handles image-only spams
Skip Montanaro
skip at pobox.com
Mon Sep 8 15:03:48 EDT 2003
Ryan> This is what I was getting at, here are results from the most
Ryan> recent 1549 messages of each of my own corpora, which are probably
Ryan> biased towards HTML ham:
Ryan> ham ham % spam spam %
Ryan> <P> 953 61.5% 1022 66.0%
Ryan> <BR> 1223 79.0% 1009 65.1%
Ryan> <TD 67 4.3% 425 27.4%
Ryan> <font 1250 80.7% 1039 67.1%
Ryan> <img 53 3.4% 817 52.7%
Ryan> Total 1549 1549
Ryan> As you can see, because so many people who use Outlook, Outlook
Ryan> Express, and Notes to send me ham, HTML tags are present in a
Ryan> great amount of what I receive. (Except of course for <TD, which
Ryan> only seems to be ham when someone is sending excerpts from a
Ryan> spreadsheet to me, and <img, which is only used when people send
Ryan> me photos or joke images.)
Do you have any evidence which suggests that SpamBayes is not properly
scoring your mail?
Ryan> My basic argument is that arbitrarily throwing out some HTML
Ryan> tokens in the parser, while leaving others, might make the filter
Ryan> more effective for only certain corpora. What test corpora was
Ryan> this decision based on?
What HTML tokens are kept? Which are thrown out? As far as I know, all are
discarded, though URLs are checked.
I've appended a sample message pulled from mail I received within the last
hour or so. Note that the score is bogus. I added these options to my ini
file:
[Classifier]
max_discriminators: 1000
minimum_prob_strength: 0.0
to make sure all tokens were included in the debug header. The message
actually scored 0.99 (rounded) using my current training database without
resorting to tokenizing HTML. Note, in particular, that url:gif is a fairly
spammy token for me. If most of the mail you get containing <img> tags is
spam, I suspect url:gif and url:jpg are spammy for you as well.
Note all the url:* synthetic tokens that were generated. Also, note that
the url:imgemail_r?_c1 tokens are generated even though they only appear in
the <img> tags.
I think SpamBayes is extracting just about all the useful content it can
from the message already, even from the <img> tags. Adding an html:img
token probably wouldn't change the way any given message scores (it wouldn't
be much spammier than url:gif or url:jpg). It appears that SpamBayes is
already generating several URL-related tokens per <img> tag. It would
simply add one more synthetic token to all those which are currently
generated.
Return-Path: <cmaster at mojam.com>
Received: from localhost [127.0.0.1] by localhost with POP3 (fetchmail-6.1.0)
for skip at localhost (single-drop);
Mon, 08 Sep 2003 13:04:51 -0500 (CDT)
Received: (from cmaster at localhost)
by manatee.mojam.com (8.11.6/8.11.6) id h88I1QD16395
for skip at manatee.mojam.com; Mon, 8 Sep 2003 13:01:26 -0500
Received: from thanatos.imocos.com (thanatos.imocos.com [195.126.165.234] (may
be forged))
by manatee.mojam.com (8.11.6/8.11.6) with SMTP id h88I1MG16381
for <webmaster at webfast.com>; Mon, 8 Sep 2003 13:01:22 -0500
Received: from [134.178.134.172]
by thanatos.imocos.com with ESMTP id E576BA30510;
Tue, 09 Sep 2003 02:56:11 +0300
Message-ID: <2$08fr$w2u616$$-p2581-d15u46mzc at 12iumxr>
X-Mailer: MIME-tools 5.503 (Entity 5.501)
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="93F30FF9E3.0F__DFDC0_43"
X-Priority: 3
X-MSMail-Priority: Normal
X-UIDL: 1&9"!2Na"!p2I"!5EQ"!
From: "Jarrod Hendrickson" <d72lpkjdbc at laoficina.com>
To: webmaster at webfast.com
Subject: RE: Join Global Remove NO Spam List
Date: Tue, 09 Sep 03 02:56:11 GMT
Reply-To: "Jarrod Hendrickson" <d72lpkjdbc at laoficina.com>
X-Spambayes-Classification: spam; 0.88
X-Spambayes-Debug: '*H*': 0.01; '*S*': 0.76; 'subject:List': 0.07;
'subject:Spam': 0.26; 'subject:: ': 0.27;
'header:Message-ID:1': 0.39; 'subject:Global': 0.41;
'header:Received:4': 0.44; 'proto:http': 0.47; 'subject: ': 0.48;
'url:net': 0.49; 'header:To:1': 0.50; 'header:Subject:1': 0.50;
'header:From:1': 0.50; 'header:Date:1': 0.50;
'from:addr:d72lpkjdbc': 0.50;
'from:name:jarrod hendrickson': 0.50;
'message-id:@12iumxr': 0.50; 'url:gr1': 0.50;
'url:imgemail_r1_c1': 0.50; 'url:imgemail_r2_c1': 0.50;
'url:it5150': 0.50; 'url:sr1': 0.50;
'header:Return-Path:1': 0.50; 'to:2**0': 0.51;
'to:no real name:2**0': 0.53; 'header:MIME-Version:1': 0.53;
'header:Reply-To:1': 0.57; 'url:www': 0.63;
'subject:Remove': 0.66; 'subject:Join': 0.72;
'content-type:multipart/alternative': 0.75; 'url:htm': 0.82;
'content-type:text/html': 0.83; 'earthenware': 0.84;
'from:addr:laoficina.com': 0.84; 'to:addr:webmaster': 0.88;
'url:gif': 0.89;
'x-mailer:mime-tools 5.503 (entity 5.501)': 0.99;
'to:addr:webfast.com': 0.99
--93F30FF9E3.0F__DFDC0_43
Content-Type: text/html;
Content-Transfer-Encoding: quoted-printable
<html>
<body>earthenware
<table width=3D"400" border=3D"0" cellspacing=3D"0" cellpadding=3D"0">
<tr>
<td><a href=3D"http://www.it5150.net/gr1.htm"><img src=3D"http://www.i=
t5150.net/imgemail_r1_c1.gif" width=3D"577" height=3D"377" border=3D"0"></=
a></td>
</tr>
<tr>
<td><a href=3D"http://www.it5150.net/sr1.htm"><img src=3D"http://www.i=
t5150.net/imgemail_r2_c1.gif" width=3D"577" height=3D"78" border=3D"0"></a=
></td>
</tr>
</table>
</body>
</html>
--93F30FF9E3.0F__DFDC0_43--
Ryan> I think keeping some form of <img as tokens as tokens would help
Ryan> my detection of image-only spam, which seems to slip through
Ryan> SpamBayes more often than other types of spam. I also think it
Ryan> would be even better to have a multi-word token something like
Ryan> that produced by the CRM-114 token generator, which could find
Ryan> multi-tag strings like <img*src*http. These suggestions are just
Ryan> based on my knowledge of the algorithms involved and the contents
Ryan> of my corpora, I don't know enough python to really give them a
Ryan> try in SpamBayes (although I'm working on that ;-).
Speculating doesn't make it so. You have to back your "I think"s up with
some tests.
Skip
More information about the Spambayes
mailing list