[Spambayes] statistical comparison of enviroment?

bill parducci bill at parducci.net
Wed Mar 5 15:47:49 EST 2003

first off, FWIW i am really amazed at the level of work that has gone into just the consideration of tokenization strategies. having struggled against the spam onslaught for the last 2 years armed solely with procmail i can really appreciate the work that has been done here! (after 200+ recipes i asked myself if there wasn't a better way... and found you guys... now i *know* there is. kudos to the group, this is some great work!

obeisance complete, off to the topic at hand :o)

i have been reading through the code/documentation looking at not just the token process, but considering the data that is subject to statistical analysis as well. i might have missed this, but has anyone considered including environmental factors into the spam vs. ham analysis? a couple of things come to mind right off the bat, but i am sure more could be found:

1. time of day (would require some real granularity tweaking)

2. size of header / size message / header:message ratio

3. attachment count (MIME count) / MIME count:message size ratio

4. [space|tab|\n]:[visible char] ratio


i think that if it hasn't already been done, it would be interesting to see if statistically comparing the *phyiscal* attributes of the messages would have an effect on the accuracy of the decision. currently--and i freely admit to being a lamer in undergrad stats--i think that this information is only considered implicitly.


