[Spambayes] Need testers!

Mon, 16 Sep 2002 22:00:48 +0200

On Mon, Sep 16 2002 Tim Peters wrote:

> [Sjoerd Mullender]
> > I've been saving all my incoming mail for just over 2 weeks now, and
> > tried this test on my data.  I have collected 3117 hams and 633 spams
> > which I divided into 4 sets of 150 messages each (with some left in
> > the reservoirs).
> 
> If these came to the same box, you should be able to improve results via
> 
>     [Tokenizer]
>     count_all_header_lines: True
>     mine_received_headers: True
> 
> They're off by default because they improve results *too* much (i.e., for
> bogus reasons) when the ham and spam come from different sources.  Maybe I
> should enable them by default now?
> 
> OTOH, your error rates are already too low to measure reliably with the
> amount of data you have (0.667% of 150 messages is a single message, so is
> the smallest non-zero error rate you can possibly see).

I guess there are four interesting comparisons now:
no bayescustomize.ini - Tokenizer options
no bayescustomize.ini - Tokenizer + Classifier options
Tokenizer options - Tokenizer + Classifier options
Classifier options - Tokenizer + Classifier options

Here they are in that order:

"""
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams

false positive percentages
    0.000  0.000  tied
    0.667  0.000  won   -100.00%
    0.000  0.000  tied
    0.667  0.667  tied

won   1 times
tied  3 times
lost  0 times

total unique fp went from 2 to 1 won    -50.00%
mean fp % went from 0.333333333334 to 0.166666666667 won    -50.00%

false negative percentages
    2.000  2.000  tied
    0.667  2.000  lost  +199.85%
    0.667  0.000  won   -100.00%
    2.000  2.000  tied

won   1 times
tied  2 times
lost  1 times

total unique fn went from 8 to 9 lost   +12.50%
mean fn % went from 1.33333333333 to 1.5 lost   +12.50%
"""

"""
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams

false positive percentages
    0.000  0.000  tied
    0.667  0.667  tied
    0.000  0.000  tied
    0.667  0.667  tied

won   0 times
tied  4 times
lost  0 times

total unique fp went from 2 to 2 tied
mean fp % went from 0.333333333334 to 0.333333333334 tied

false negative percentages
    2.000  1.333  won    -33.35%
    0.667  2.000  lost  +199.85%
    0.667  0.667  tied
    2.000  2.667  lost   +33.35%

won   1 times
tied  1 times
lost  2 times

total unique fn went from 8 to 10 lost   +25.00%
mean fn % went from 1.33333333333 to 1.66666666667 lost   +25.00%
"""

"""
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams

false positive percentages
    0.000  0.000  tied
    0.000  0.667  lost  +(was 0)
    0.000  0.000  tied
    0.667  0.667  tied

won   0 times
tied  3 times
lost  1 times

total unique fp went from 1 to 2 lost  +100.00%
mean fp % went from 0.166666666667 to 0.333333333334 lost  +100.00%

false negative percentages
    2.000  1.333  won    -33.35%
    2.000  2.000  tied
    0.000  0.667  lost  +(was 0)
    2.000  2.667  lost   +33.35%

won   1 times
tied  1 times
lost  2 times

total unique fn went from 9 to 10 lost   +11.11%
mean fn % went from 1.5 to 1.66666666667 lost   +11.11%
"""

"""
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams
-> <stat> tested 150 hams & 150 spams against 450 hams & 450 spams

false positive percentages
    0.000  0.000  tied
    0.667  0.667  tied
    0.000  0.000  tied
    0.667  0.667  tied

won   0 times
tied  4 times
lost  0 times

total unique fp went from 2 to 2 tied
mean fp % went from 0.333333333334 to 0.333333333334 tied

false negative percentages
    1.333  1.333  tied
    0.667  2.000  lost  +199.85%
    0.667  0.667  tied
    1.333  2.667  lost  +100.08%

won   0 times
tied  2 times
lost  2 times

total unique fn went from 6 to 10 lost   +66.67%
mean fn % went from 0.999999999998 to 1.66666666667 lost   +66.67%
"""

> > I didn't include messages to python-list-admin@python.org or
> > postmaster@oratrix.com in my corpus.
> 
> Any particular reason?

The postmaster address gets a lot of bounced spam, and the
python-list-admin address is currently being overwhelmed with warnings
about delayed mail from some French site (328 of 511 messages).

> > And here with the later suggested change
> > [Classifier]
> > adjust_probs_by_evidence_mass: True
> > min_spamprob: 0.001
> > max_spamprob: 0.999
> > hambias: 1.5
> 
> Bingo!  This is the correct experiment to run with the current codebase.

[...]

> Cool!  It's working for you much as it worked for me, although I can be more
> confident because I've got enough data to run 10-fold c-v experiments, and
> across many distinct small random subsets.  It's encouraging that someone
> other than me <wink> is able to get sub-1% error rates with this little
> data.
> 
> Thanks for the report, Sjoerd!  You may want to look at the lists of best
> discrimators to try to guess whether your results are so good for bogus
> reasons.  Staring at the details of the false positives and negatives may
> also suggest weaknesses in the tokenization algorithms (anyone can play!
> trying out changes is fun, although testing them is tedious, time-consuming,
> and unfortunately necessary).

Two messages keep coming back as false positives.  One is an
announcement from Palm (PDA's) and the other from the new CEO of
United Airlines, os both messages are a much like UCE, except
technically they weren't U.

The false negatives are much more varied, and one of them wasn't even
a negative but was misplaced in the spam folder.

I don't really see anything weird in the best discriminators (the ones
I looked at, that is).

-- Sjoerd Mullender <sjoerd@acm.org>