[Spambayes] statistical comparison of enviroment?
T. Alexander Popiel
popiel at wolfskeep.com
Wed Mar 5 17:59:16 EST 2003
In message: <3E668CA5.3050203 at parducci.net>
bill parducci <bill at parducci.net> writes:
>i might have missed this, but has anyone considered including
>environmental factors into the spam vs. ham analysis? a couple
>of things come to mind right off the bat, but i am sure more
>could be found:
>1. time of day (would require some real granularity tweaking)
This was tried, with 10 minute intervals; testing on two
separate corpora (that of the guy who came up with the
patch and my own) showed that the effect was inconsequential.
The largest result was the observation that both ham and spam
tend to slacken a bit in the middle of the night.
>2. size of header / size message / header:message ratio
>3. attachment count (MIME count) / MIME count:message size ratio
>4. [space|tab|\n]:[visible char] ratio
All of these have been mentioned in the past, but no one to my
knowledge has actually tested them.
Please feel free to code up something to turn these ideas into
tokens... then they can be tested, and if they're useful then
they'll likely be incorporated.
Testing of new tokens like this has dropped off since about
last October... spambayes is already good enough for just
about everyone to be happy. My recent tests on training
methods seem to show that accuracy has been dropping off for
the last twho months, though, so it may be time to revisit
More information about the Spambayes