[Spambayes] statistical comparison of enviroment?
T. Alexander Popiel
popiel at wolfskeep.com
Wed Mar 5 20:03:36 EST 2003
In message: <3E66B1D6.90308 at parducci.net>
bill parducci <bill at parducci.net> writes:
>> Please feel free to code up something to turn these ideas into
>> tokens... then they can be tested, and if they're useful then
>> they'll likely be incorporated.
>ok. in the interest of time saving (i've not programmed in python
>before), how about i [tabular] list what i find and let the statistas
>in the group decide if there is significance? i have a pile of spam
>and ham that i can wade through (unless there is a standardized sample
>that is preferable).
We've actually got a pretty good testing infrastructure set up;
for tokenization tests, I personally use timcv.py with each of the
tokenization options and then feed the output of the runs into
table.py. This produces some nice tabularizations that you may
notice in the mailing list archives.
Using your own ham and spam is standard procedure here; most people
are touchy about giving their ham away due to privacy concerns.
If some new option looks good, then multiple people try it out on
their different corpora, and if it still looks good after that,
then it gets included.
Don't worry about not having coded in python before. I hadn't
done much in python before this project either, and people haven't
been screaming about how ugly my code is, yet...
More information about the Spambayes