[Spambayes] RE: For the bold

Tim Peters tim.one@comcast.net
Sat, 05 Oct 2002 20:35:49 -0400


One more test result here, using Gary's *original* central-limit scheme.
That didn't get a fair trial when it was introduced:  at the time, the
business about "certainty" under these schemes wasn't known, or even
suspected, so it looked poorer by comparison due to a seemingly large
increase in errors rates.  Now we know that *most* of that was just the
system very helpfully telling us it's unsure of its decision.  But at the
time, Gary immediately came up with central_limit2, and central_limit has
been neglected ever since.

Same setup as before, but with use_central_limit:

-> <stat> Ham scores for all runs: 7500 items; mean 0.26; sdev 3.67
-> <stat> min 0; median 0; max 100
* = 123 items
  0 7461 *************************************************************
 25   37 *
 50    1 *
 75    1 *

-> <stat> Spam scores for all runs: 7500 items; mean 99.75; sdev 3.59
-> <stat> min 0; median 100; max 100
* = 123 items
  0    1 *
 25    3 *
 50   33 *
 75 7463 *************************************************************

Overall, it's quite comparable to the two other central limit variations,
just uncertain slightly (in absolute terms) more often.  The uncertainty
increase is large in *relative* terms, though, which is why this looked like
a big jump in error rates when it was first tried.

Crunching the raw data via rmspik:

Reading clim.pik ...
Nham= 7500
RmsZham= 2.93763751621
Nspam= 7500
RmsZspam= 3.62374621717
======================================================================
HAM:
Sure/ok       7491
Unsure/ok     8
Unsure/not ok 1
Sure/not ok   0
Unsure rate = 0.12%
Sure fp rate = 0.00%; Unsure fp rate = 11.11%
======================================================================
SPAM:
FALSE NEGATIVE: zham=4.22 zspam=-4.08 Data/Spam/Set4/3434.txt SURE!
FALSE NEGATIVE: zham=4.55 zspam=-3.75 Data/Spam/Set4/635.txt SURE!
FALSE NEGATIVE: zham=4.90 zspam=-3.41 Data/Spam/Set6/12822.txt SURE!
FALSE NEGATIVE: zham=3.18 zspam=-5.12 Data/Spam/Set7/4234.txt SURE!
FALSE NEGATIVE: zham=4.85 zspam=-3.45 Data/Spam/Set8/975.txt SURE!
Sure/ok       0
Unsure/ok     0
Unsure/not ok 7495
Sure/not ok   5
Unsure rate = 99.93%
Sure fn rate = 100.00%; Unsure fn rate = 100.00%

So the RMS business is certain very much more often under the original
central limit scheme:

                 RMS ham unsure    RMS spam unsure
                 --------------    ---------------
central_limit                 9                  0
central_limit2              175                 77
central_limit3              184                227

That suggests to me that, whatever the heck <wink> RMS is doing, it's a much
better fit to the original central_limit scheme, but has a bizarre problem
with spam there.  I don't know whether I care about it, though, as it would
have leaked 5 spam out of 7500, and that's a measly 0.067% total f-n rate.

Let's look at the "sure but wrong" FN there:

FALSE NEGATIVE: zham=4.22 zspam=-4.08 Data/Spam/Set4/3434.txt SURE!
   The "Hello, my Name is BlackIntrepid" spam, discussed at length
   previously here.  Had no spam indicators at all when
   max_discriminators was 16 under the Graham scheme (highest
   spamprob among the 16 most extreme was about 0.05(!)).

FALSE NEGATIVE: zham=4.55 zspam=-3.75 Data/Spam/Set4/635.txt SURE!
   A short "just folks" spam that has given lots of schemes trouble:

"""
Return-Path: <scottmark1968@hotmail.com>
Delivered-To: em-ca-bruceg@em.ca
Received: (qmail 13437 invoked from network); 16 Aug 2002 02:37:15 -0000
Received: from unknown (HELO pakistan) (203.135.9.174)
  by churchill.factcomp.com with SMTP; 16 Aug 2002 02:37:15 -0000
From: "Scott Mark" <scottmark1968@hotmail.com>
To: <bruceg@em.ca>
Subject: Hello !
Mime-Version: 1.0
Content-Type: text/html; charset="iso-8859-1"
Date: Fri, 9 Aug 2002 08:35:24
Content-Length: 609

<BR>
Hi,      <BR>
<BR>
Just wanted you to check out this cool online website builder. It lets
people create cool websites in minutes and for free. You can create your own
Flash animations and Intro as well. Its really simple and easy to use :) and
its all a matter of minutes, you'll have an impressive website up and
running in no time, i'm impressed ... I bet you'll be impressed as well.<BR>
<BR>
This website gives a nice review and how to get started creating your first
website easily : <a href="http://www.click-free.com">www.click-free.com</a>
<BR>
<BR>
Thanks,      <BR>
Scott Mark.   <BR>
"""


FALSE NEGATIVE: zham=4.90 zspam=-3.41 Data/Spam/Set6/12822.txt SURE!
    "Subject: Website Programmers Available Now"

    Loaded with tech terms related to web design and programming,
    a frequent topic on c.l.py (my ham).  The more-extreme central-
    limit schemes get huge benefit out of extremely large spamprob
    words like "offshore".  Extreme extreme <wink> words don't have
    such extreme effect under the original cl scheme.


FALSE NEGATIVE: zham=3.18 zspam=-5.12 Data/Spam/Set7/4234.txt SURE!
    "Subject: www.NameYork.com / Webmaster link directory"

    I've had lots of trouble with this one before.  It's a long HTML
    msg full of links that would be of actual interest to webmasters.
    It even includes a link for Python.  I've never been entirely
    sure that it's spam, but it "smells more" like spam than ham
    to me.


FALSE NEGATIVE: zham=4.85 zspam=-3.45 Data/Spam/Set8/975.txt SURE!
    Another "just folks" spam that has given lots of schemes
    trouble:

"""
Return-Path: <jarph3@core.com>
Delivered-To: em-ca-bruceg@em.ca
Received: (qmail 27970 invoked from network); 14 Jul 2002 01:43:00 -0000
Received: from agamemnon.bfsmedia.com (204.83.201.2)
  by churchill.factcomp.com with SMTP; 14 Jul 2002 01:43:00 -0000
Received: (qmail 28917 invoked from network); 14 Jul 2002 01:26:53 -0000
Received: from c-24-131-114-96.mw.client2.attbi.com (HELO core.com)
(24.131.114.96)
  by agamemnon.bfsmedia.com with SMTP; 14 Jul 2002 01:26:53 -0000
From: "jarph3@core.com" <jarph3@core.com>
To: <bruceg@em.ca>
Subject: I want to share with you what I found
Sender: "jarph3@core.com" <jarph3@core.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
Date: Sat, 13 Jul 2002 21:16:04 -0400
Content-Length: 676

My brother asked me to design a web page for his band Tainted Emotions. At
first, his site was nothing more than a few paragraphs describing his
unique psychotic melodies.  Although a good start, mere words failed to
convey the complete Tainted Emotions experience.  For that, I needed
graphics.  Not just any graphics though.  Fast, sleek, and professional
images that only my brother's band deserves.

I found all the free public domain photos I needed at freewebgrafix.com.
They had everything an aspiring graphics designer needs to transform a
texty site into a graphic sensation.  Animated GIFs, backgrounds, banners,
and of course--photos.

http://www.freewebgrafix.com
"""

I can live with spam like that -- the combination of original-cl and RMS
looks very much worth pursuing.