[Spambayes] "Lindsey Carter": Re: [Zope-Annce] New
zope.orgdevelopment
Tim Peters
tim.one@comcast.net
Mon Nov 4 20:24:48 2002
[Jeremy Hylton]
> On the other hand, the message you forwarded got scored 0.494 with
> both *H* and *S* > 0.98. I'm quite puzzled, though, about how my
> training data is getting used. I looked back at the spam that came
> through (now in my spam training set) and see that it got scored
> 0.000. It now gets scored 1.000, but for reasons that don't really
> make sense to me.
An endless string of hapaxes. This is what mistake-based training can be
*expected* to do over time: swing wildly from near 0 to near 1 (or vice
versa).
> Here's a snippet of the spamish word from the detailed scoring:
>
> >available 0.844827586207
Every word with that spamprob is a hapax (unique to this msg). The
Bayseian-adjusted spamprob for a word is
s*x + n*p
---------
s+n
where, for a spam hapax, p=1.0 and n=1. s and x are taken from Options.py
unless you've overridden them; the defaults are s=0.45 and x=0.5. Plug
those all in and you get
>>> (.45 * 0.5 + 1 * 1.0) / (.45 + 1)
0.84482758620689669
>>>
for a spam hapax.
> >cc: 0.844827586207
> >corp. 0.844827586207
> >dickenson 0.844827586207
> >persistence 0.844827586207
> >pure 0.844827586207
> >released 0.844827586207
> >source 0.844827586207
> >unexpected 0.844827586207
> >windows 0.844827586207
> >zeo 0.844827586207
> >zodb 0.844827586207
> >zope 0.844827586207
> [zope-annce] 0.844827586207
> approve, 0.844827586207
> area! 0.844827586207
> behavior.) 0.844827586207
> beta 0.844827586207
> btrees 0.844827586207
> compiler, 0.844827586207
> conflict 0.844827586207
> cream? 0.844827586207
> email addr:zope.org, 0.844827586207
> emails, 0.844827586207
> fav. 0.844827586207
> from:"lindsey 0.844827586207
> from:carter" 0.844827586207
> from:email name:<smileylindsey72001 0.844827586207
So they're *all* spam hapaxes, trained on exactly once, in that email.
In Guido's example under my classifier, I get a dozen stronger-than-hapax
spam clues, thanks to training regularly on correctly classified spam too:
'url:asp' 0.856992
'skin' 0.88473
'skilled' 0.908163
'>do' 0.908163
'url:index' 0.911483
'shocking' 0.921667
'ice' 0.934783
'ads,' 0.958716
'emails,' 0.965116
'part-time' 0.97619
'area!' 0.987106
'pics' 0.991159
> It seems like I'm still doing something wrong with pspam and training
> but I don't know what. The odd thing is that I tend to get good
> results,
Most spam is easy to detect even from hapaxes. That's what makes
mistake-based training tempting, I'm afraid.
> the osaf lists aside.
What's an osaf list?
More information about the Spambayes
mailing list