[Spambayes] how can i extract the tokens and their spaminess-value?

Thu Apr 7 01:20:09 CEST 2005

> I am working on an art-project for a paper, where spam should 
> be read by a Text-To-Speech engine using the value of spaminess
> applied to the tokens to change the intonation of the voice. So
> I want to use SB in another way - hoping that this is not against
> the will of the inventors - to get the value of spaminess together
> with the tokens out of the programm.

The will of the inventors (as expressed in the license) is that people do
what they like with it, so have fun :)

> My questions therefor:
> 
> 1. Is there perhaps already a testing tool that creates an 
> output in form of a table or tagged-txt containing all single
> tokens of an email-body and its value of spaminess?

contrib/showclues.py (in CVS or in the forthcoming 1.1a1) includes a table
that has a list of all tokens in the message, their ham/spam counts, and
their spamprob.  You could probably extract what you want from here.

(The output is much the same as the Outlook plug-in's "show spam clues for
this message" function, and can also be generated via the web interface.
The tokens can be included in sb_filter output via the
include_evidence_header option).

(Note that these give scores for the tokens that were used in the
calculation, not all tokens in the message).

Or you could use a custom Python script.  The attached script should do what
you want (assuming you have Python installed and spambayes on the
PYTHONPATH) - change the call to main() to main(True) if you want all tokens
rather than just the ones used in the calculation of the score.

> 2. or since I suppose that this process is happening in 
> tokenization/ classification of SB, where in the code can i 
> find it and what would i have to change to get the solicited
> output?

Tokenization is handled in tokenizer.py, and probabilities are generated in
classifier.py.  You wouldn't want to change anything, just write a script
like the attached.

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.
-------------- next part --------------
import sys
import email

from spambayes.tokenizer import tokenize
from spambayes.storage import open_storage, database_type

def main(all_tokens=False):
    db_name, db_type = database_type({})
    bayes = open_storage(db_name, db_type)
    msg = email.message_from_file(open(sys.argv[1]))
    score, tokens = bayes.spamprob(tokenize(msg), True)
    print "Message scored", score

    fetchword = bayes.wordinfo.get
    if all_tokens:
        tokens = [(t, bayes.probability(fetchword(t))) \
                  for t in tokenize(msg) if fetchword(t) is not None]
    for word, prob in tokens:
        record = fetchword(word)
        if record:
            nham = record.hamcount
            nspam = record.spamcount
        else:
            nham = nspam = "-"
        word = repr(word)
        print word, " " * (35-len(word)),
        print " %-12g %8s %6s" % (prob, nham, nspam)

if __name__ == "__main__":
    main()