[Spambayes] New tokenization of the Subject line

Sun, 06 Oct 2002 23:12:10 -0400

[Remi Ricard]
> I try something again.
>
> Since most of the mail from subscribed groups have in their
> subject [spambayes] or [freesco] i.e "[" and "]".
> I decided to keep this as a word so my words from a subject line
> like: Re: [Spambayes] Moving closer to Gary's ideal
> will be
> Re:
> [Spambayes]
> Moving
> closer
> to
> Gary's
> ideal

Two things about that:

1. It's not a precise enough description to know exactly what you
   did.  On a list with programmers, don't be afraid to show code <wink>.

2. Do you think it's more likely that a spam would have "freesco"
   than "[freesco]" in its Subject line?  Not bloodly likely <wink>.
   That is, you couldn't have picked worse examples for selling the
   idea that this *might* help.  Indeed, that may be why it didn't
   help.

It's usually more fruitful to stare at mistakes made by the system, and then
see if there's something about them in common that the tokenizer isn't
presenting in a usable way (very clear example:  we throw away uuencoded
pieces entirely; very muddy example:  we throw away info about how many
times a word appears in a msg).

> And this is the result.

Alex did a nice of job of running thru this, so I'll skip to the end.

> -> <stat> tested 200 hams & 279 spams against 800 hams & 1113 spams
> -> <stat> tested 200 hams & 275 spams against 800 hams & 1117 spams
> -> <stat> tested 200 hams & 298 spams against 800 hams & 1094 spams
> -> <stat> tested 200 hams & 272 spams against 800 hams & 1120 spams
> -> <stat> tested 200 hams & 268 spams against 800 hams & 1124 spams
> -> <stat> tested 200 hams & 279 spams against 800 hams & 1113 spams
> -> <stat> tested 200 hams & 275 spams against 800 hams & 1117 spams
> -> <stat> tested 200 hams & 298 spams against 800 hams & 1094 spams
> -> <stat> tested 200 hams & 272 spams against 800 hams & 1120 spams
> -> <stat> tested 200 hams & 268 spams against 800 hams & 1124 spams
>
> false positive percentages
>     1.000  0.500  won    -50.00%
>     1.500  1.500  tied
>     2.000  2.500  lost   +25.00%
>     1.000  1.000  tied
>     0.000  0.000  tied
>
> won   1 times
> tied  3 times
> lost  1 times
>
> total unique fp went from 11 to 11 tied
> mean fp % went from 1.1 to 1.1 tied
>
> false negative percentages
>     0.717  0.717  tied
>     0.727  0.727  tied
>     1.007  1.342  lost   +33.27%
>     0.000  0.368  lost  +(was 0)
>     0.746  0.373  won    -50.00%
>
> won   1 times
> tied  2 times
> lost  2 times
>
> total unique fn went from 9 to 10 lost   +11.11%
> mean fn % went from 0.639419734305 to 0.705436374356 lost   +10.32%
>
> ham mean                     ham sdev
>   24.51   25.20   +2.82%        9.45    9.09   -3.81%
>   26.14   27.20   +4.06%        8.62    8.32   -3.48%
>   26.04   26.94   +3.46%       10.00    9.68   -3.20%
>   25.15   25.85   +2.78%        8.05    7.93   -1.49%
>   25.12   26.11   +3.94%        8.28    8.16   -1.45%
>
> ham mean and sdev for all runs
>   25.39   26.26   +3.43%        8.93    8.69   -2.69%
>
> spam mean                    spam sdev
>   80.41   79.86   -0.68%        8.80    8.81   +0.11%
>   79.87   79.47   -0.50%        8.20    8.11   -1.10%
>   79.87   79.31   -0.70%        8.79    8.73   -0.68%
>   80.42   80.03   -0.48%        8.13    8.22   +1.11%
>   80.11   79.70   -0.51%        9.32    9.07   -2.68%
>
> spam mean and sdev for all runs
>   80.13   79.66   -0.59%        8.66    8.60   -0.69%
>
> ham/spam mean difference: 54.74 53.40 -1.34

[T. Alexander Popiel]
> This shows ham and spam getting closer together overall, and
> is bad.  The reduction in the standard deviation is (I think)
> too small to overcome this... but I'm just eyeballing it;
> can someone with a bit of the theory help here?

Not much in this case, because it had nothing else going for it:  the
conclusion to give up on this idea should have been reached long before
getting to this point <wink>.

We don't know what this distribution "looks like", exactly.  It appears to
be "kinda normal", but is tighter than normal at the endpoints, and looser
than normal where the tails dribble toward each other.  This limits the
usefulness we can get out of sdevs:  the only thoroughly general result is
that, for *any* distribution, no more than 1/k**2 of the data lives more
than k standard deviations away from the mean.  This is an especially
useless result when k <= 1 <wink>.  There's a one-tailed version that says
something non-trivial for k <= 1:

    http://www.btinternet.com/~se16/hgb/cheb.htm

But we're more interested in the overlap, and that occurs at higher k.

The rule of thumb I fall back on is that, *whatever* sdev means for this
distribution, I assume it means much the same thing across testers, and that
(which is justified although hard to quantify here) separating the means by
more sdevs is a good thing.  So I look for the value of k such that (and
assuming mean1 < mean2):

    mean1 + k * sdev1 = mean2 - k * sdev2

or, rearranging,

    mean2 - mean1
k = -------------
    sdev1 + sdev2

That tells us the score that's "equally far away" from both means in a
standard-deviation sense, and how far away that is from both means (in units
of standard deviations).

A little Python helps:

def findk(mean1, sdev1, mean2, sdev2):
    """Solve mean1 + k*sdev1 = mean2 - k*sdev2 for k.

    Return (k, common value).
    """

    assert mean1 < mean2
    k = (mean2 - mean1) / (sdev1 + sdev2)
    score = mean1 + k * sdev1
    return k, score

Plugging in the "before" means and sdevs gives:

>>> findk(25.39, 8.93, 80.13, 8.66)
(3.1119954519613415, 53.180119386014781)

BTW, if you don't favor one kind of error over another, this suggests
spam_cutoff=0.5318 may well be a good value for this data.  If it isn't, the
direction it errs in is a clue about which distribution is stranger.

Plugging in the "after" values gives:

>>> findk(26.26, 8.69, 79.66, 8.60)
(3.0884904569115093, 53.098982070561014)
>>>

So the means have gotten a tiny bit closer in an sdev sense too (they meet
at 3.09 sdevs from both, instead of at 3.11 before).  The difference is so
small as to be insignificant, though.