[Spambayes] Outlook plugin - training

Sat Nov 9 20:00:42 2002

[Tim]
> I'm never going to get sub-0.1% error rates this way, but if this is the
> best it ever got, I'd be quite happy with it for my personal email.
> Something to ponder?  If so, you can get away with a very small
> database, and while hapaxes must not be removed blindly in this extreme
> scheme, using the atime field could (I suspect) be very effective in
> slashing the already-small database size (lots of hapaxes will never be
> seen again even if you train on everything; the WordInfo atime field
> tells you when a word was last used at all).

BTW, I'm still doing this experiment, and my total training data is up to 45
ham and 38 spam, out of a total of about 1,700 msgs processed so far.  FP
are FN are both rare now, and the Unsure rate is about 5% overall and
visibly falling.  The Unsure spam are more surprising than the Unsure ham,
but that may be more psychological than real.  For example, it took about 24
hours before I got my first Nigerian spam, and it was shocking to see it
score at the low end of the Unsure range.

Looking at the internals is scary.  I have entire folders that are called
ham seemingly because the mailing list they come from has a few lexical
conventions unique to it, and the hapaxes from the single training msg from
that list save almost all of that list's msgs from Unsure status.

In the msg of Rob's I'm replying to, these are all ham hapaxes:

'database'                     0.155172
'database,'                    0.155172
'ever'                         0.155172
'idea'                         0.155172
'quite'                        0.155172
'scheme,'                      0.155172
'seen'                         0.155172
'subject:Outlook'              0.155172
'subject:Spambayes'            0.155172
'subject:plugin'               0.155172
'subject:training'             0.155172
'tells'                        0.155172
'words'                        0.155172

and they slug it out with these spam hapaxes:

'away'                         0.844828
'effective'                    0.844828
'field'                        0.844828
'mean'                         0.844828
'word'                         0.844828

That 'word' is a strong spam clue but 'words' a strong ham clue should tell
us something about how robust this is <wink>.

[Rob Hooft]
> This seems to imply that you're still playing with the idea that hapaxes
> could be "slashed" from the database when using the "old" train-on-all
> procedure. I don't see how that can ever work, as all words pass through
> the hapax stage at some point. Or do you mean to slash "old" hapaxes
> only?

Well, training has no effect on scoring until update_probabilities() is
called, and in a batch-training context I mean hapax from
update_probabilities's POV.  Of course hamcounts or spamcounts for new words
start life at 1, but when doing batch training I don't mean to look at the
counts until the probabilities are updated.  At that point, a hapax is a
word that was seen in only one msg from the entire batch of new msgs.

Here's a quick test, based on unpublished general python.org email (we can't
publish the ham because it includes some personal email; GregW was working
on making the spam collection available, but I haven't heard about that in a
week; ditto his very large python.org virus collection).

In each case, it trains on 2,741 ham and 948 spam, then predicts the same
numbers of each.  The "all" column includes hapaxes (wrt counts at the *end*
of training).  The gt1 column threw away words at the end of training where
spamcount+hamcount <= 1; i.e., it retained only words that appeared more
than once, the non-hapaxes.   The gt2 column retained only words that
appeared more than twice; and so on.  ham_cutoff was 0.20 here, and
spam_cutoff 0.90.

filename:      all     gt1     gt2     gt3     gt4     gt5     gt6
ham:spam:  2741:948        2741:948        2741:948        2741:948
                   2741:948        2741:948        2741:948
fp total:        1       0       1       0       0       0       0
fp %:         0.04    0.00    0.04    0.00    0.00    0.00    0.00
fn total:        2       2       2       1       2       3       4
fn %:         0.21    0.21    0.21    0.11    0.21    0.32    0.42
unsure t:       81      87      89      82      98      96     100
unsure %:     2.20    2.36    2.41    2.22    2.66    2.60    2.71
real cost:  $28.20  $19.40  $29.80  $17.40  $21.60  $22.20  $24.00
best cost:  $22.20  $17.60  $20.00  $15.40  $16.80  $17.40  $22.40
h mean:       0.81    0.86    0.87    0.72    0.67    0.64    0.65
h sdev:       6.05    6.18    6.17    5.42    5.13    4.94    5.11
s mean:      98.00   97.66   97.54   97.38   97.03   96.62   96.52
s sdev:       9.26   10.22   10.37   10.62   11.19   12.49   12.61
mean diff:   97.19   96.80   96.67   96.66   96.36   95.98   95.87
k:            6.35    5.90    5.84    6.03    5.90    5.51    5.41

# retained
     words:  74327   36437   23877   16143   12798   10719    9157

So while hapaxes are vital with very little training data, even with "just"
about 4K training msgs they didn't buy anything in this test, and neither
did words that appeared only two or three times, and it doesn't appear to be
touchy (all of these columns show excellent results!).

> And what is "old"?

That remains a good question, and a good answer may differ between personal
email and bulk email applications.  A problem I see coming up in my personal
email is that some correspondents only show up once a year, and the hapaxes
they generate remain valuable clues, but only once a year.  General
python.org email doesn't appear to suffer anything like that (so long as
personal email is kept out of the python.org mix).