[Spambayes] statistical comparison of enviroment?
bill at parducci.net
Wed Mar 5 15:47:49 EST 2003
first off, FWIW i am really amazed at the level of work that has gone into just the consideration of tokenization strategies. having struggled against the spam onslaught for the last 2 years armed solely with procmail i can really appreciate the work that has been done here! (after 200+ recipes i asked myself if there wasn't a better way... and found you guys... now i *know* there is. kudos to the group, this is some great work!
obeisance complete, off to the topic at hand :o)
i have been reading through the code/documentation looking at not just the token process, but considering the data that is subject to statistical analysis as well. i might have missed this, but has anyone considered including environmental factors into the spam vs. ham analysis? a couple of things come to mind right off the bat, but i am sure more could be found:
1. time of day (would require some real granularity tweaking)
2. size of header / size message / header:message ratio
3. attachment count (MIME count) / MIME count:message size ratio
4. [space|tab|\n]:[visible char] ratio
i think that if it hasn't already been done, it would be interesting to see if statistically comparing the *phyiscal* attributes of the messages would have an effect on the accuracy of the decision. currently--and i freely admit to being a lamer in undergrad stats--i think that this information is only considered implicitly.
More information about the Spambayes