[Spambayes] Some test results

skip at pobox.com skip at pobox.com
Mon Aug 7 00:50:44 CEST 2006


I put together some test databases today using spam received in the past
week or so (about 1800 messages) and a reasonable cross-section of my ham
(all saved python-related mail plus my regular non-specific mailbox, about
2300 messages) and did some 5x5 cross-validation tests (that's the correct
term, right?).  For the control test I set all these options False:

    x-lookup_ip
    x-short_runs
    x-image_size
    x-crack_images

but otherwise used my standard configuration.  I then made four runs,
setting one option True for each run, then compared each test with the
control run.  The results are summarized briefly below.

    control v. x-lookup_ip
    ----------------------

    false positive percentages
        0.000  0.000  tied          
        0.217  0.217  tied          
        0.000  0.000  tied          
        0.219  0.219  tied          
        0.000  0.000  tied          

    won   0 times
    tied  5 times
    lost  0 times

    ...

    false negative percentages
        4.199  4.199  tied          
        1.404  1.404  tied          
        4.412  4.412  tied          
        4.533  4.533  tied          
        4.222  4.222  tied          

    won   0 times
    tied  5 times
    lost  0 times

    control v. x-short_runs
    -----------------------

    false positive percentages
        0.000  0.000  tied          
        0.217  0.217  tied          
        0.000  0.000  tied          
        0.219  0.219  tied          
        0.000  0.000  tied          

    won   0 times
    tied  5 times
    lost  0 times

    ...

    false negative percentages
        4.199  4.199  tied          
        1.404  1.404  tied          
        4.412  4.412  tied          
        4.533  4.533  tied          
        4.222  4.222  tied          

    won   0 times
    tied  5 times
    lost  0 times

    control v. x-image_size
    -----------------------

    false positive percentages
        0.000  0.000  tied          
        0.217  0.434  lost  +100.00%
        0.000  0.000  tied          
        0.219  0.219  tied          
        0.000  0.000  tied          

    won   0 times
    tied  4 times
    lost  1 times

    ...

    false negative percentages
        4.199  4.199  tied          
        1.404  1.404  tied          
        4.412  4.118  won     -6.66%
        4.533  4.533  tied          
        4.222  3.958  won     -6.25%

    won   2 times
    tied  3 times
    lost  0 times

    control v. x-crack_images
    -------------------------

    false positive percentages
        0.000  0.000  tied          
        0.217  0.217  tied          
        0.000  0.000  tied          
        0.219  0.219  tied          
        0.000  0.000  tied          

    won   0 times
    tied  5 times
    lost  0 times

    ...

    false negative percentages
        4.199  4.199  tied          
        1.404  1.404  tied          
        4.412  4.118  won     -6.66%
        4.533  3.966  won    -12.51%
        4.222  3.430  won    -18.76%

    won   3 times
    tied  2 times
    lost  0 times

I didn't do anything to verify the accuracy of my spam and ham data.  I'm
doing that now.  Also, the fact that the first two tests were identical to
the control seems a bit suspicious, so I'm going to try them again after
picking over my training database.  Still, the image_size and crack_images
runs look promising, perhaps because my recent spam is so full of these pump
and dump spams.

Skip


More information about the SpamBayes mailing list