[Spambayes] how spambayes handles image-only spams

Skip Montanaro skip at pobox.com
Mon Sep 8 15:03:48 EDT 2003


    Ryan> This is what I was getting at, here are results from the most
    Ryan> recent 1549 messages of each of my own corpora, which are probably
    Ryan> biased towards HTML ham:

    Ryan>       ham     ham %   spam    spam %

    Ryan> <P>   953     61.5%   1022    66.0%
    Ryan> <BR>  1223    79.0%   1009    65.1%
    Ryan> <TD   67      4.3%    425     27.4%
    Ryan> <font 1250    80.7%   1039    67.1%
    Ryan> <img  53      3.4%    817     52.7%

    Ryan> Total 1549            1549    

    Ryan> As you can see, because so many people who use Outlook, Outlook
    Ryan> Express, and Notes to send me ham, HTML tags are present in a
    Ryan> great amount of what I receive. (Except of course for <TD, which
    Ryan> only seems to be ham when someone is sending excerpts from a
    Ryan> spreadsheet to me, and <img, which is only used when people send
    Ryan> me photos or joke images.)

Do you have any evidence which suggests that SpamBayes is not properly
scoring your mail?  

    Ryan> My basic argument is that arbitrarily throwing out some HTML
    Ryan> tokens in the parser, while leaving others, might make the filter
    Ryan> more effective for only certain corpora. What test corpora was
    Ryan> this decision based on?

What HTML tokens are kept?  Which are thrown out?  As far as I know, all are
discarded, though URLs are checked.

I've appended a sample message pulled from mail I received within the last
hour or so.  Note that the score is bogus.  I added these options to my ini
file:

    [Classifier]
    max_discriminators: 1000
    minimum_prob_strength: 0.0

to make sure all tokens were included in the debug header. The message
actually scored 0.99 (rounded) using my current training database without
resorting to tokenizing HTML.  Note, in particular, that url:gif is a fairly
spammy token for me.  If most of the mail you get containing <img> tags is
spam, I suspect url:gif and url:jpg are spammy for you as well.

Note all the url:* synthetic tokens that were generated.  Also, note that
the url:imgemail_r?_c1 tokens are generated even though they only appear in
the <img> tags.

I think SpamBayes is extracting just about all the useful content it can
from the message already, even from the <img> tags.  Adding an html:img
token probably wouldn't change the way any given message scores (it wouldn't
be much spammier than url:gif or url:jpg).  It appears that SpamBayes is
already generating several URL-related tokens per <img> tag.  It would
simply add one more synthetic token to all those which are currently
generated.

Return-Path: <cmaster at mojam.com>
Received: from localhost [127.0.0.1] by localhost with POP3 (fetchmail-6.1.0)
        for skip at localhost (single-drop);
        Mon, 08 Sep 2003 13:04:51 -0500 (CDT)
Received: (from cmaster at localhost)
        by manatee.mojam.com (8.11.6/8.11.6) id h88I1QD16395
        for skip at manatee.mojam.com; Mon, 8 Sep 2003 13:01:26 -0500
Received: from thanatos.imocos.com (thanatos.imocos.com [195.126.165.234] (may
        be forged))
        by manatee.mojam.com (8.11.6/8.11.6) with SMTP id h88I1MG16381
        for <webmaster at webfast.com>; Mon, 8 Sep 2003 13:01:22 -0500
Received: from [134.178.134.172]
        by thanatos.imocos.com with ESMTP id E576BA30510;
        Tue, 09 Sep 2003 02:56:11 +0300
Message-ID: <2$08fr$w2u616$$-p2581-d15u46mzc at 12iumxr>
X-Mailer: MIME-tools 5.503 (Entity 5.501)
MIME-Version: 1.0
Content-Type: multipart/alternative;
        boundary="93F30FF9E3.0F__DFDC0_43"
X-Priority: 3
X-MSMail-Priority: Normal
X-UIDL: 1&9"!2Na"!p2I"!5EQ"!
From: "Jarrod Hendrickson" <d72lpkjdbc at laoficina.com>
To: webmaster at webfast.com
Subject: RE: Join Global Remove NO Spam List
Date: Tue, 09 Sep 03 02:56:11 GMT
Reply-To: "Jarrod Hendrickson" <d72lpkjdbc at laoficina.com>
X-Spambayes-Classification: spam; 0.88
X-Spambayes-Debug: '*H*': 0.01; '*S*': 0.76; 'subject:List': 0.07;
        'subject:Spam': 0.26; 'subject:: ': 0.27;
        'header:Message-ID:1': 0.39; 'subject:Global': 0.41;
        'header:Received:4': 0.44; 'proto:http': 0.47; 'subject: ': 0.48;
        'url:net': 0.49; 'header:To:1': 0.50; 'header:Subject:1': 0.50;
        'header:From:1': 0.50; 'header:Date:1': 0.50;
        'from:addr:d72lpkjdbc': 0.50;
        'from:name:jarrod hendrickson': 0.50;
        'message-id:@12iumxr': 0.50; 'url:gr1': 0.50;
        'url:imgemail_r1_c1': 0.50; 'url:imgemail_r2_c1': 0.50;
        'url:it5150': 0.50; 'url:sr1': 0.50;
        'header:Return-Path:1': 0.50; 'to:2**0': 0.51;
        'to:no real name:2**0': 0.53; 'header:MIME-Version:1': 0.53;
        'header:Reply-To:1': 0.57; 'url:www': 0.63;
        'subject:Remove': 0.66; 'subject:Join': 0.72;
        'content-type:multipart/alternative': 0.75; 'url:htm': 0.82;
        'content-type:text/html': 0.83; 'earthenware': 0.84;
        'from:addr:laoficina.com': 0.84; 'to:addr:webmaster': 0.88;
        'url:gif': 0.89;
        'x-mailer:mime-tools 5.503 (entity 5.501)': 0.99;
        'to:addr:webfast.com': 0.99


--93F30FF9E3.0F__DFDC0_43
Content-Type: text/html;
Content-Transfer-Encoding: quoted-printable

<html>
<body>earthenware
<table width=3D"400" border=3D"0" cellspacing=3D"0" cellpadding=3D"0">
  <tr>
    <td><a href=3D"http://www.it5150.net/gr1.htm"><img src=3D"http://www.i=
t5150.net/imgemail_r1_c1.gif" width=3D"577" height=3D"377" border=3D"0"></=
a></td>
  </tr>
  <tr>
    <td><a href=3D"http://www.it5150.net/sr1.htm"><img src=3D"http://www.i=
t5150.net/imgemail_r2_c1.gif" width=3D"577" height=3D"78" border=3D"0"></a=
></td>
  </tr>
</table>
</body>
</html>

--93F30FF9E3.0F__DFDC0_43--

    Ryan> I think keeping some form of <img as tokens as tokens would help
    Ryan> my detection of image-only spam, which seems to slip through
    Ryan> SpamBayes more often than other types of spam. I also think it
    Ryan> would be even better to have a multi-word token something like
    Ryan> that produced by the CRM-114 token generator, which could find
    Ryan> multi-tag strings like <img*src*http.  These suggestions are just
    Ryan> based on my knowledge of the algorithms involved and the contents
    Ryan> of my corpora, I don't know enough python to really give them a
    Ryan> try in SpamBayes (although I'm working on that ;-).

Speculating doesn't make it so.  You have to back your "I think"s up with
some tests.

Skip



More information about the Spambayes mailing list