[Spambayes] how can i extract the tokens and their
spaminess-value?
Tony Meyer
tameyer at ihug.co.nz
Thu Apr 7 01:20:09 CEST 2005
> I am working on an art-project for a paper, where spam should
> be read by a Text-To-Speech engine using the value of spaminess
> applied to the tokens to change the intonation of the voice. So
> I want to use SB in another way - hoping that this is not against
> the will of the inventors - to get the value of spaminess together
> with the tokens out of the programm.
The will of the inventors (as expressed in the license) is that people do
what they like with it, so have fun :)
> My questions therefor:
>
> 1. Is there perhaps already a testing tool that creates an
> output in form of a table or tagged-txt containing all single
> tokens of an email-body and its value of spaminess?
contrib/showclues.py (in CVS or in the forthcoming 1.1a1) includes a table
that has a list of all tokens in the message, their ham/spam counts, and
their spamprob. You could probably extract what you want from here.
(The output is much the same as the Outlook plug-in's "show spam clues for
this message" function, and can also be generated via the web interface.
The tokens can be included in sb_filter output via the
include_evidence_header option).
(Note that these give scores for the tokens that were used in the
calculation, not all tokens in the message).
Or you could use a custom Python script. The attached script should do what
you want (assuming you have Python installed and spambayes on the
PYTHONPATH) - change the call to main() to main(True) if you want all tokens
rather than just the ones used in the calculation of the score.
> 2. or since I suppose that this process is happening in
> tokenization/ classification of SB, where in the code can i
> find it and what would i have to change to get the solicited
> output?
Tokenization is handled in tokenizer.py, and probabilities are generated in
classifier.py. You wouldn't want to change anything, just write a script
like the attached.
=Tony.Meyer
--
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.
-------------- next part --------------
import sys
import email
from spambayes.tokenizer import tokenize
from spambayes.storage import open_storage, database_type
def main(all_tokens=False):
db_name, db_type = database_type({})
bayes = open_storage(db_name, db_type)
msg = email.message_from_file(open(sys.argv[1]))
score, tokens = bayes.spamprob(tokenize(msg), True)
print "Message scored", score
fetchword = bayes.wordinfo.get
if all_tokens:
tokens = [(t, bayes.probability(fetchword(t))) \
for t in tokenize(msg) if fetchword(t) is not None]
for word, prob in tokens:
record = fetchword(word)
if record:
nham = record.hamcount
nspam = record.spamcount
else:
nham = nspam = "-"
word = repr(word)
print word, " " * (35-len(word)),
print " %-12g %8s %6s" % (prob, nham, nspam)
if __name__ == "__main__":
main()
More information about the Spambayes
mailing list