[Spambayes] XTreme Training

Tim Peters tim.one@comcast.net
Tue, 10 Sep 2002 23:46:15 -0400


[Tim]
> ...
> At the other extreme, training on half my ham&spam, and scoring aginst
> the other half
> ...
>     false positive rate: 0.0100%
>     false negative rate: 0.3636%
> ...
> Alas, all 4 of the 0.99 clues there are HTML-related.

That begged to try it again but with Tokenize/retain_pure_html_tags false.
The random halves getting trained on and scored against are different here,
and I repaired the bug that dropped 1 ham and 1 spam on the floor, so this
isn't exactly a 1-change difference between runs.

Ham distribution for all runs:
* = 167 items
  0.00 9999 ************************************************************
 10.00    0
 20.00    0
 30.00    0
 40.00    0
 50.00    0
 60.00    0
 70.00    0
 80.00    0
 90.00    1 *

Spam distribution for all runs:
* = 115 items
  0.00   21 *
 10.00    0
 20.00    0
 30.00    1 *
 40.00    0
 50.00    0
 60.00    1 *
 70.00    0
 80.00    1 *
 90.00 6852 ************************************************************

    false positive rate: 0.0100%
    false negative rate: 0.3490%

Yay!  That may mean that HTML tags aren't really needed in my test data
provided it's trained on enough stuff.  Curiously, the sole false positive
here is the same as the sole false positive on the half&half run reported in
the preceding msg (I assume the Nigerian scam "false positive" just happened
to end up in the training data both times):

************************************************************************
Data/Ham/Set4/107687.txt
prob = 0.999632042904
prob('python.') = 0.01
prob('alteration') = 0.01
prob('edinburgh') = 0.01
prob('subject:Python') = 0.01
prob('header:Errors-To:1') = 0.0216278
prob('thanks,') = 0.0319955
prob('help?') = 0.041806
prob('road,') = 0.0462364
prob('there,') = 0.0722794
prob('us.') = 0.906609
prob('our') = 0.919118
prob('company,') = 0.921852
prob('visit') = 0.930785
prob('sent.') = 0.939882
prob('e-mail') = 0.949765
prob('courses') = 0.954726
prob('received') = 0.955209
prob('analyst') = 0.960756
prob('investment') = 0.975139
prob('regulated') = 0.99
prob('e-mails') = 0.99
prob('mills') = 0.99

Received: from [195.171.5.71] (helo=node401.dmz.standardlife.com)
        by mail.python.org with esmtp (Exim 3.21 #1)
        id 15rDsu-00085k-00
        for python-list@python.org; Wed, 10 Oct 2001 03:34:32 -0400
Received: from slukdcn4.internal.standardlife.com (slukdcn4.standardlife.com
        [10.3.2.72])
        by node401.dmz.standardlife.com (Pro-8.9.3/Pro-8.9.3) with SMTP id
        IAA53660for <python-list@python.org>; Wed, 10 Oct 2001 08:34:00
+0100
Received: from sl079320 ([172.31.88.231]) by
        slukdcn4.internal.standardlife.com (Lotus SMTP MTA v4.6.6  (890.1
7-16-1999))
        with SMTP id 80256AE1.00294B60; Wed, 10 Oct 2001 08:31:02 +0100
Message-ID:
<007e01c1515d$bb255940$e7581fac@sl079320.internal.standardlife.com>
From: "Vickie Mills" <vickie_mills@standardlife.com>
To: <python-list@python.org>
Subject: Training Courses in Python in UK
Date: Wed, 10 Oct 2001 08:32:30 +0100
MIME-Version: 1.0
Content-Type: text/plain;
        charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 4.72.3155.0
X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3155.0
Sender: python-list-admin@python.org
Errors-To: python-list-admin@python.org
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.0.6 (101270)
Precedence: bulk
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Post: <mailto:python-list@python.org>
List-Subscribe: <http://mail.python.org/mailman/listinfo/python-list>,
        <mailto:python-list-request@python.org?subject=subscribe>
List-Id: General discussion list for the Python programming language
        <python-list.python.org>
List-Unsubscribe: <http://mail.python.org/mailman/listinfo/python-list>,
        <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-list/>

Hi there,

I am looking for you recommendations on training courses available in the UK
on Python.  Can you help?

Thanks,

Vickie Mills
IS Training Analyst

Tel:    0131 245 1127
Fax:    0131 245 1550
E-mail:    vickie_mills@standardlife.com

For more information on Standard Life, visit our website
http://www.standardlife.com/   The Standard Life Assurance Company, Standard
Life House, 30 Lothian Road, Edinburgh EH1 2DH, is registered in Scotland
(No SZ4) and regulated by the Personal Investment Authority.  Tel: 0131 225
2552 - calls may be recorded or monitored.  This confidential e-mail is for
the addressee only.  If received in error, do not retain/copy/disclose it
without our consent and please return it to us.  We virus scan all e-mails
but are not responsible for any damage caused by a virus or alteration by a
third party after it is sent.
************************************************************************

The top 30 discriminators are more interesting now:

        'income' 629 0.99
        'http0:python' 643 0.01
        'header:MiME-Version:1' 672 0.99
        'http1:remove' 693 0.99
        'content-type:text/html' 711 0.982345
        'string' 714 0.01
        'http>1:jpg' 776 0.99
        'object' 813 0.01
        'python,' 852 0.01
        'python.' 882 0.01
        'language' 883 0.01
        '>>>' 907 0.01
        'header:Return-Path:2' 907 0.99
        'unsubscribe' 975 0.99
        'header:Received:7' 1113 0.99
        'def' 1142 0.01
        'http>1:gif' 1168 0.99
        'module' 1169 0.01
        'import' 1332 0.01
        'header:Received:8' 1342 0.99
        'header:Errors-To:1' 1377 0.0216278
        'header:In-Reply-To:1' 1402 0.01
        'wrote' 1753 0.01
        '&nbsp;' 2067 0.99
        'subject:Python' 2140 0.01
        'header:User-Agent:1' 2322 0.01
        'header:X-Complaints-To:1' 4351 0.01
        'wrote:' 4370 0.01
        'python' 4972 0.01
        'header:Organization:1' 6921 0.01

There are still two HTML clues remaining there ("&nbsp;" and
"content-type:text/html").  Anthony's trick accounts for almost a third of
these.  "Python" appears in 5 of them ('http0:python' means that 'python'
was found in the 1st field of an embedded http:// URL).  Sticking a .gif or
a .jpg in  a URL both score as 0.99 spam clues.  Note the damning pattern of
capitalization in 'header:MiME-Version:1'!  This counting is case-sensitive,
and nobody ever would have guessed that MiME is more damning than SUBJECT or
DATE.  Why would spam be likely to end up with two instances of Return-Path
in the headers?