[Spambayes] results of mining post time - slight loss

Skip Montanaro skip@pobox.com
Tue, 1 Oct 2002 08:52:55 -0500


(forgot to press the send key yesterday evening...)

Using six-minute time buckets gleaned from Date: headers, here are the
results (executive summary: slight loss).  Buckets were computed as I
suggested in my previous email:

    (h*60+m)//10

that is, six-minute intervals (maybe I should name this option the
lawyer-fee-increment (*)?)

Before:

    [TestDriver]
    spam_cutoff: 0.4

After:

    [Tokenizer]
    mine_date_headers: True

    [TestDriver]
    spam_cutoff: 0.4

Results:

    cutoffs -> times
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    ... yadda yadda yadda

    false positive percentages
        1.000  1.000  tied          
        1.500  1.500  tied          
        1.000  1.000  tied          
        1.000  1.500  lost   +50.00%
        1.000  1.000  tied          
        1.500  1.500  tied          
        3.500  3.500  tied          
        1.500  1.500  tied          
        1.500  1.500  tied          
        1.500  2.000  lost   +33.33%

    won   0 times
    tied  8 times
    lost  2 times

    total unique fp went from 30 to 32 lost    +6.67%
    mean fp % went from 1.5 to 1.6 lost    +6.67%

    false negative percentages
        0.500  0.500  tied          
        1.500  1.500  tied          
        0.500  0.500  tied          
        0.500  0.500  tied          
        2.000  2.000  tied          
        0.000  0.000  tied          
        1.000  1.000  tied          
        1.000  1.000  tied          
        0.000  0.000  tied          
        1.500  1.500  tied          

    won   0 times
    tied 10 times
    lost  0 times

    total unique fn went from 17 to 17 tied          
    mean fn % went from 0.85 to 0.85 tied          

    ham mean                     ham sdev
      20.82   20.98   +0.77%        6.43    6.47   +0.62%
      21.86   21.96   +0.46%        6.63    6.62   -0.15%
      21.38   21.52   +0.65%        6.49    6.56   +1.08%
      21.96   22.09   +0.59%        6.26    6.29   +0.48%
      21.51   21.67   +0.74%        6.72    6.75   +0.45%
      21.66   21.78   +0.55%        6.98    7.00   +0.29%
      21.45   21.59   +0.65%        7.66    7.62   -0.52%
      21.74   21.88   +0.64%        6.69    6.68   -0.15%
      21.71   21.84   +0.60%        7.44    7.43   -0.13%
      21.87   21.96   +0.41%        5.93    5.93   +0.00%

    ham mean and sdev for all runs
      21.60   21.73   +0.60%        6.75    6.76   +0.15%

    spam mean                    spam sdev
      74.10   73.87   -0.31%       12.99   12.80   -1.46%
      72.47   72.28   -0.26%       13.92   13.79   -0.93%
      74.05   73.83   -0.30%       13.00   12.85   -1.15%
      74.00   73.83   -0.23%       12.27   12.11   -1.30%
      72.43   72.18   -0.35%       13.73   13.45   -2.04%
      72.68   72.44   -0.33%       13.27   13.11   -1.21%
      72.57   72.44   -0.18%       13.03   12.94   -0.69%
      71.50   71.34   -0.22%       12.12   12.01   -0.91%
      73.25   73.05   -0.27%       12.67   12.50   -1.34%
      73.02   72.81   -0.29%       12.44   12.29   -1.21%

    spam mean and sdev for all runs
      73.01   72.81   -0.27%       12.98   12.82   -1.23%

    ham/spam mean difference: 51.41 51.08 -0.33

Skip

(*) It's a sad commentary on the litigiousness of Americans if someone like
me who's basically never been to a lawyer recognizes the stereotypical
six-minute increment lawyers are supposed to use to bill their clients.  (Or
maybe I watched too much "LA Law" at a crucial period of my life...)