[Spambayes] RE: Re: [Design] Contacts (Michael R. Bernstein)

Tim Peters tim.one@comcast.net
Fri Nov 1 03:20:55 2002


[Jeremy Hylton]
> Interesting effect.  I signed up for a couple of new mailing lists
> (concerning the Kapor PIM project).  The discussions on them seem to
> be very different than the stuff I usually get, and the conclusions it
> that at least some of it is definitely spam.
>
> It's unfortunate that they get marked as spam instead of unsure.  It
> means that getting mail in a new subject area means that the
> classifier will make some wildly wrong guesses until you get enough
> new training data.

If it gets a high spam score, that simply reflects how you've trained; e.g.,
for a tech guy to have the word "computer" as high-spamprob word is
suspicious all by itself:

> computer 0.851704776271

Other oddities:

> applications 0.81854602306
> installed 0.81854602306
> full 0.829689461024
> once 0.831240651842
> people 0.832090040931


> results 0.844279808423
> ...and 0.844827586207
> hacked 0.844827586207
> techniques, 0.844827586207

Those four look like hapaxes, to judge from the scores.

> computer, 0.923563305955
> application 0.935918946825
> security 0.939824371865
> contacts 0.948037345934
> list, 0.970027495049
> data. 0.973372781065

I can only assume you've only trained it on Shakespeare ham <wink>.

Even stranger are the words that *don't* show up with high spamprobs for
you.  I haven't signed up for this list, but my personal classifier scored
the attachment very differently:

Spam Score: 2.85542e-008

'*H*'                          1
'*S*'                          1.40346e-010
'wrote:'                       0.001868
'subject:: ['                  0.0142256
'url:mailman'                  0.0151898
'url:listinfo'                 0.0182172
'url:lists'                    0.0193688

That you didn't have those as low-spamprob words suggests you've trained on
almost no mailing-list ham.

'otoh,'                        0.0215311
'interface'                    0.0310106
'(so'                          0.0348837
'scripting'                    0.0348837
'thu,'                         0.0412844
'url:org'                      0.0474607
'solved'                       0.050232
'it).'                         0.0505618
'x-mailer:ximian evolution 1.0.8' 0.0505618
'header:In-Reply-To:1'         0.0539033
'false'                        0.0564499
'header:Errors-To:1'           0.0634285
'>from:'                       0.0652174
'[snip]'                       0.0652174
'origin'                       0.0652174
'pointless'                    0.0652174
'protocol'                     0.0652174
'share'                        0.076162
'subject:] '                   0.0842127
'tool.'                        0.0918367
'challenge'                    0.0918367
'machine,'                     0.0918367
'api,'                         0.0918367
'techniques,'                  0.0918367
'subset'                       0.0918367
'key,'                         0.0918367
'apps'                         0.0918367
'except'                       0.0982036
'(in'                          0.114502
'copy'                         0.12034
'problem'                      0.12813
'encrypted'                    0.132432
'returned'                     0.138575
'quite'                        0.140119
'foundation'                   0.150981
'compromise'                   0.155172
'automating'                   0.155172
'widely.'                      0.155172
'url:osafoundation'            0.155172
'key.'                         0.155172
'horses'                       0.155172
'header:Received:5'            0.156821
'running'                      0.158385
'obviously'                    0.160346
'user'                         0.168307
'machine.'                     0.181282
'interesting'                  0.186509
'code'                         0.191394
'feature'                      0.196331
'(there'                       0.197397
'book,'                        0.197397
'list'                         0.199755
'also'                         0.202693
'probably'                     0.205907
'environment.'                 0.208559
'mine'                         0.208559
'michael'                      0.213552
'those'                        0.215792
'entry'                        0.218192
'installed'                    0.232266
'requiring'                    0.245609
'belong'                       0.245609
'distribution'                 0.248786
'shared'                       0.253444
'but'                          0.254219
'data'                         0.257012
'>from'                        0.262199
'insecure'                     0.262199
'open'                         0.268785
'think'                        0.276856
'wrong'                        0.283272
'which'                        0.284943
'saying'                       0.287973
'source'                       0.291711
'his'                          0.293522
'application'                  0.306776
'should'                       0.307599
'address'                      0.308855
'machine'                      0.309309
'users'                        0.321312
'e-mail'                       0.32136
'public'                       0.322013
'keys'                         0.334772
"can't"                        0.334823
'were'                         0.338587
'using'                        0.345285
'needs'                        0.34717
'used'                         0.348371
'mailing'                      0.349775
'having'                       0.351608
'header:Message-Id:1'          0.353295
'part'                         0.356067
'bit'                          0.359997
'skip:s 10'                    0.361335
"it's"                         0.362434
'anyone'                       0.365533
'once'                         0.369773
"won't"                        0.370267
'avoid'                        0.370705
'provided'                     0.376177
"isn't"                        0.379705
'joel'                         0.382591
'widely'                       0.382591
'with'                         0.384731
'there'                        0.390536
'that'                         0.394845
'even'                         0.600998
'sharing'                      0.605368
'key'                          0.60538
'header:Return-Path:1'         0.616771
'every'                        0.617388
'easy'                         0.62261
'full'                         0.625898
'large'                        0.628002
'trust'                        0.673698
'capabilities'                 0.674899
'computer,'                    0.682833
'secure'                       0.689415
'results'                      0.693054
'happen'                       0.711906
'contacts'                     0.716121
'here.'                        0.717214
'place.'                       0.720453
'easily'                       0.750357
'further'                      0.766401
'information'                  0.778198
'again.'                       0.779556
'securely,'                    0.844828
'trusting'                     0.844828
'trojan'                       0.844828
'hacked'                       0.844828
'from:"michael'                0.844828
'claiming'                     0.844828
'response'                     0.878172
'list,'                        0.881299
'2..'                          0.886933
'header:Mime-Version:1'        0.889037
'data.'                        0.889756
'url:design'                   0.908163
'...and'                       0.969799
'wealth'                       0.97619

And that you didn't get 'wealth' as a high-spamprob word suggests something
even weirder.

> Of the 14 new messages, I see 1 ham, 3 spam, and 10 unsure.  I've
> forwarded one of the high scoring spams.

Train on them; it will learn what you teach it, and nothing else.