[Spambayes] Dictionary Analysis Tool

Skip Montanaro skip at pobox.com
Thu Dec 11 15:30:09 EST 2003


    Kent> I love the SpamClues feature, but I'd really like to know -- by
    Kent> "word" what "words" have the highest to lowest probability of
    Kent> occurring in my SpamBayes for all messages. Putting it another
    Kent> way, I'd like to know what the top N most "spammy" words for me.
 
    Kent> Is there a tool or other way to do this?

There is a spamcounts.py script in the SpamBayes contrib directory.  It will
accept a regular expression to decide what tokens to display.  Run it like
so:

    spamcounts.py -r '.*'

and it will dump a CSV file to standard output which contains all the tokens
in your current database.  It looks like so:

    token,nspam,nham,spam prob
    $63.01,1,0,0.844827586207
    $1.99,1,0,0.844827586207
    from:addr:detik.com,1,0,0.844827586207
    four,1,2,0.310046433094
    to:addr:ski,1,0,0.844827586207
    "advertisers,",1,0,0.844827586207
    08:06:09,0,1,0.155172413793
    ...

You can just pop that into Excel (or other favorite spreadsheet) and sort by
the "spam prob" column or feed it into a Python script which uses the csv
module to load it back up, sort it, then display the N rows with the highest
spam prob.

Skip



More information about the Spambayes mailing list