[Spambayes] Run filter and only return a report???

Tony Meyer tameyer at ihug.co.nz
Thu Feb 10 23:32:32 CET 2005


> I'd probably only be interested in the tokens that were used 
> in scoring and the output just needs to be in an easily parseable format.

If you do something like this:

python scripts/sb_filter.py -d hammie.db -o Headers:include_evidence:True <
msg.txt

You'll get a header in the message copy output to stdout that looks like
this:

X-Spambayes-Evidence: '*H*': 1.00; '*S*': 0.00; 'shows': 0.06; 'cheers':
0.09;
        'finished': 0.09; 'show.': 0.09; 'channel': 0.09; 'url:nz': 0.10;
        'school,': 0.13; 'enough': 0.13; 'space': 0.13; 'acting': 0.16;
        'there,': 0.16; 'trailers': 0.16; 'xtra': 0.16; 'yeah,': 0.16;
        "year's": 0.16; 'broadband': 0.17; 'year': 0.18; 'done': 0.18;
        'movie': 0.19; 'keen': 0.20; 'url:co': 0.20; "i've": 0.21;
        "you're": 0.21; 'getting': 0.21; 'high': 0.23; 'next': 0.23;
        'find': 0.24; 'just': 0.25; 'first': 0.25; 'let': 0.26;
        'but': 0.27; 'when': 0.27; 'couple': 0.28; 'know': 0.28;
        'really': 0.29; 'like': 0.30; "don't": 0.30; 'online': 0.31;
        'header:Mime-Version:1': 0.32; 'watch': 0.32; 'please': 0.33;
        "i'm": 0.33; 'message-id:@hotmail.com': 0.34; 'with': 0.38;
        'header:Return-path:1': 0.40; "subject:'": 0.66;
        'from:addr:hotmail.com': 0.69; 'header:Received:4': 0.72;
        'to:addr:madsods.gen.nz': 0.83; 'ellis': 0.84;
        'subject:show': 0.84; 'subject:year': 0.84; 'skip:_ 60': 0.91

These are just the tokens that are used ('*H*' and '*S*' are special
internal tokens that represent the individual ham and spam scores; you
probably want to ignore those).  Parsing that would be reasonably simple.

> Right, just give me a score, don't make any changes to the 
> database or attempt to deliver the message.

Running the above command follows those rules.

> Thanks.  I need to add Python to the list of programming 
> languages I know.

It only takes a day <wink>.

> Basically, a friend who's company uses SpamBayes with the Outlook 
> plug-in sent me a report he saw, here is a summary:
> 
> Combined Score: 100% (0.999998)
> Internal ham score (*H*): 4.79832e-006
> Internal spam score (*S*): 1
> 
> # ham trained on: 89
> # spam trained on: 1733
> 28 Significant Tokens
> 
> token                               spamprob         #ham  #spam
> 'x-mailer:microsoft office outlook, build 11.0.6353' 0.168914 
>   2      7
> 'url:org'                           0.254701           13     86
> 'url:rec-html40'                    0.277582            3     22
> 'skip:r 10'                         0.284156           28    216
> 'skip:p 10'                         0.321735           31    286
> 'url:tr'                            0.372452            4     46
> 'url:www'                           0.384768           63    767
> 'virus:src="cid:'                   0.72041             3    151
> 'from:addr:level3.net'              0.844828            0      1
> 'subject:\xe4'                      0.844828            0      1
> 	.
> 	.
> 
> That's basically the kind of report I would like to see.

Ok, I've ripped out the code from the Outlook plug-in that does this and
converted it to a command-line script (attached).  Run it something like:

python showclues.py -d hammie.db < msg.txt

It does output in HTML at the moment, because that's what the Outlook
plug-in does (for an Outlook-specific reason).  It would be simple enough to
strip the HTML out of the script, though (I imagine even without knowledge
of Python).  If you'd like that done, I don't mind doing it (this script
seems potentially useful enough for me to check it into the contrib/
directory).  Let me know if there are any other improvements you can think
of.

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.
-------------- next part --------------
import cgi
import sys
import getopt

from spambayes import storage
from spambayes import mboxutils
from spambayes.classifier import Set
from spambayes.Options import options
from spambayes.tokenizer import tokenize

def ShowClues(bayes, msg):
    score, clues = bayes.spamprob(tokenize(msg), evidence=True)
    body = ["<h2>Combined Score: %d%% (%g)</h2>\n" %
            (round(score*100), score)]
    push = body.append

    # Format internal scores.
    push("Internal ham score (<tt>%s</tt>): %g<br>\n" % clues.pop(0))
    push("Internal spam score (<tt>%s</tt>): %g<br>\n" % clues.pop(0))

    # Format the # ham and spam trained on.
    push("<br>\n")
    push("# ham trained on: %d<br>\n" % bayes.nham)
    push("# spam trained on: %d<br>\n" % bayes.nspam)
    push("<br>\n")

    # Format the clues.
    push("<h2>%s Significant Tokens</h2>\n<PRE>" % len(clues))
    push("<strong>")
    push("token                               spamprob         #ham  #spam\n")
    push("</strong>\n")
    format = " %-12g %8s %6s\n"
    fetchword = bayes.wordinfo.get
    for word, prob in clues:
        record = fetchword(word)
        if record:
            nham = record.hamcount
            nspam = record.spamcount
        else:
            nham = nspam = "-"
        word = repr(word)
        push(cgi.escape(word) + " " * (35-len(word)))
        push(format % (prob, nham, nspam))
    push("</PRE>\n")

    # Now the raw text of the message
    push("<h2>Message Stream</h2>\n")
    push("<PRE>\n")
    push(cgi.escape(msg.as_string()))
    push("</PRE>\n")

    # Show all the tokens in the message
    push("<h2>All Message Tokens</h2>\n")
    # need to re-fetch, as the tokens we see may be different based on
    # header stripping.
    toks = Set(tokenize(msg))
    # create a sorted list
    toks = list(toks)
    toks.sort()
    push("%d unique tokens<br><br>" % len(toks))
    # Use <code> instead of <pre>, as <pre> is not word-wrapped by IE
    # However, <code> does not require escaping.
    # could use pprint, but not worth it.
    for token in toks:
        push("<code>" + repr(token) + "</code><br>\n")

    # Put the body together, then the rest of the message.
    body = ''.join(body)
    body = """\
<HTML>
<HEAD>
<STYLE>
    h2 {color: green}
</STYLE>
</HEAD>
<BODY>""" + body + "</BODY></HTML>"
    return body

if __name__ == "__main__":
    opts, args = getopt.getopt(sys.argv[1:], 'd:p:o:')
    for opt, arg in opts:
        if opt in ('-o', '--option'):
            options.set_from_cmdline(arg, sys.stderr)
    dbname, usedb = storage.database_type(opts)
    bayes = storage.open_storage(dbname, usedb)
    bayes.load()

    if not args:
        args = ["-"]
    for fname in args:
        mbox = mboxutils.getmbox(fname)
        for msg in mbox:
            print ShowClues(bayes, msg)


More information about the Spambayes mailing list