[Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.orgdevelopment

Mon Nov 4 20:24:48 2002

[Jeremy Hylton]
> On the other hand, the message you forwarded got scored 0.494 with
> both *H* and *S* > 0.98.  I'm quite puzzled, though, about how my
> training data is getting used.  I looked back at the spam that came
> through (now in my spam training set) and see that it got scored
> 0.000.  It now gets scored 1.000, but for reasons that don't really
> make sense to me.

An endless string of hapaxes.  This is what mistake-based training can be
*expected* to do over time:  swing wildly from near 0 to near 1 (or vice
versa).

> Here's a snippet of the spamish word from the detailed scoring:
>
> >available 0.844827586207

Every word with that spamprob is a hapax (unique to this msg).  The
Bayseian-adjusted spamprob for a word is

     s*x + n*p
     ---------
        s+n

where, for a spam hapax, p=1.0 and n=1.  s and x are taken from Options.py
unless you've overridden them; the defaults are s=0.45 and x=0.5.  Plug
those all in and you get

>>> (.45 * 0.5 + 1 * 1.0) / (.45 + 1)
0.84482758620689669
>>>

for a spam hapax.

> >cc: 0.844827586207
> >corp. 0.844827586207
> >dickenson 0.844827586207
> >persistence 0.844827586207
> >pure 0.844827586207
> >released 0.844827586207
> >source 0.844827586207
> >unexpected 0.844827586207
> >windows 0.844827586207
> >zeo 0.844827586207
> >zodb 0.844827586207
> >zope 0.844827586207
> [zope-annce] 0.844827586207
> approve, 0.844827586207
> area! 0.844827586207
> behavior.) 0.844827586207
> beta 0.844827586207
> btrees 0.844827586207
> compiler, 0.844827586207
> conflict 0.844827586207
> cream? 0.844827586207
> email addr:zope.org, 0.844827586207
> emails, 0.844827586207
> fav. 0.844827586207
> from:"lindsey 0.844827586207
> from:carter" 0.844827586207
> from:email name:<smileylindsey72001 0.844827586207

So  they're *all* spam hapaxes, trained on exactly once, in that email.

In Guido's example under my classifier, I get a dozen stronger-than-hapax
spam clues, thanks to training regularly on correctly classified spam too:

'url:asp'                      0.856992
'skin'                         0.88473
'skilled'                      0.908163
'>do'                          0.908163
'url:index'                    0.911483
'shocking'                     0.921667
'ice'                          0.934783
'ads,'                         0.958716
'emails,'                      0.965116
'part-time'                    0.97619
'area!'                        0.987106
'pics'                         0.991159

> It seems like I'm still doing something wrong with pspam and training
> but I don't know what.  The odd thing is that I tend to get good
> results,

Most spam is easy to detect even from hapaxes.  That's what makes
mistake-based training tempting, I'm afraid.

> the osaf lists aside.

What's an osaf list?