[Spambayes] Dictionary Analysis Tool

Seth Goodman nobody at spamcop.net
Fri Dec 12 14:21:53 EST 2003


[Skip Montanaro]
> There is a spamcounts.py script in the SpamBayes contrib
> directory.  It will
> accept a regular expression to decide what tokens to display.  Run it like
> so:
>
>     spamcounts.py -r '.*'
>
> and it will dump a CSV file to standard output which contains all
> the tokens
> in your current database.  It looks like so:
>
>     token,nspam,nham,spam prob
>     $63.01,1,0,0.844827586207
>     $1.99,1,0,0.844827586207
>     from:addr:detik.com,1,0,0.844827586207
>     four,1,2,0.310046433094
>     to:addr:ski,1,0,0.844827586207
>     "advertisers,",1,0,0.844827586207
>     08:06:09,0,1,0.155172413793
>     ...
>
> You can just pop that into Excel (or other favorite spreadsheet)
> and sort by
> the "spam prob" column or feed it into a Python script which uses the csv
> module to load it back up, sort it, then display the N rows with
> the highest
> spam prob.

Ooo, I like that.  Since I am running the Outlook plug-in, what do I have to
do to be able to use this?  Won't there be a conflict if I bring in the
source modules from CVS and run the install scripts?  Could you give us a
recipe for Outlook users who would like to mess with (or mess up) the source
code and run it (crash it)?  Also, which CVS version should we work with,
considering we are not developers but would want to contribute working stuff
to you?  Some of the newer CVS forks have a lot of neat stuff implemented
and without them, we might wind up re-inventing the wheel.  Wow, a wheel,
what a great idea!  Think I'll write it up.

A related question is where is the database of message ID's that are already
trained?  I know the system knows this as it won't train on the same copy of
a message twice.

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above




More information about the Spambayes mailing list