[Spambayes] New tokenization of the Subject line

T. Alexander Popiel popiel@wolfskeep.com
Fri, 04 Oct 2002 18:30:50 -0700


In message:  <1033781185.1125.7.camel@localhost.localdomain>
             Remi Ricard <papaDoc@videotron.ca> writes:
>
>I try something again.
>
>Since most of the mail from subscribed groups have in their
>subject [spambayes] or [freesco] i.e "[" and "]".
>
>I decided to keep this as a word

Unfortunately, this makes things worse overall.  Good idea,
but I think that it's not helping because mailing lists
get spammed, too... so showing that something is on a
mailing list really doesn't help (it just gives the spam
that does show up on the list some apparent validity).

>total unique fp went from 11 to 11 tied          
>mean fp % went from 1.1 to 1.1 tied          

This is neutral.

>total unique fn went from 9 to 10 lost   +11.11%
>mean fn % went from 0.639419734305 to 0.705436374356 lost   +10.32%

This is a loss, though too small of one to be significant.
(One message in either direction is too small to care about.)

>ham mean and sdev for all runs
>  25.39   26.26   +3.43%        8.93    8.69   -2.69%

This shows the ham scores moving up, and getting tighter
together.  The first is bad, the second is good.

>spam mean and sdev for all runs
>  80.13   79.66   -0.59%        8.66    8.60   -0.69%

This shows the spam scores moving down, and getting tighter.
Again, first is bad, second is good.

>ham/spam mean difference: 54.74 53.40 -1.34

This shows ham and spam getting closer together overall, and
is bad.  The reduction in the standard deviation is (I think)
too small to overcome this... but I'm just eyeballing it;
can someone with a bit of the theory help here?

- Alex