[Python-Dev] The first trustworthy <wink> GBayes results

Tim Peters tim.one@comcast.net
Wed, 28 Aug 2002 15:12:53 -0400

FYI.  After cleaning the blatant spam identified by the classifier out of my
ham corpus, and replacing it with new random msgs from Barry's corpus, the
reported false positive rate fell to about 0.2% (averaging 8 per each batch
of 4000 ham test messages).  This seems remarkable given that it's ignoring
headers, and just splitting the raw text on whitespace in total ignorance of
HTML & MIME etc.

'FREE' (all caps) moved into the ranks of best spam indicators.  The false
negative rate got reduced by a small amount, but I doubt it's a
statistically significant reduction (I'll compute that stuff later; I'm
looking for Big Things now).

Some of these false positives are almost certainly spam, and at least one is
almost certainly a virus:  these are msgs that are 100% base64-encoded, or
maximally obfuscated quoted-printable.  That could almost certainly be fixed
by, e.g., decoding encoded text.

The other false positives seem harder to deal with:

+ Brief HMTL msgs from newbies.  I doubt the headers will help these
  get through, as they're generally first-time posters, and aren't
  replies to earlier msgs.  There's little positive content, while
  all elements of raw HTML have high "it's spam" probability.


Content-Description: filename="text1.txt"
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Is there a version of Python with Prolog Extension??
Where can I find it if there is?


P.S. Could you please reply to the sender too.

Content-Description: filename="text1.html"
Content-Type: text/html
Content-Transfer-Encoding: quoted-printable

        <TITLE>Prolog Extension</TITLE>
        <META NAME=3D"GENERATOR" CONTENT=3D"StarOffice/5.1 (Linux)">
        <META NAME=3D"CREATED" CONTENT=3D"19991127;12040200">
        <META NAME=3D"CHANGEDBY" CONTENT=3D"Luis Cortes">
        <META NAME=3D"CHANGED" CONTENT=3D"19991127;12044700">
<PRE>Is there a version of Python with Prolog Extension??
Where can I find it if there is?


P.S. Could you please reply to the sender too.</PRE>


Here's how it got scored:

prob = 0.999958816093
prob('<META') = 0.957529
prob('<META') = 0.957529
prob('<META') = 0.957529
prob('<BODY>') = 0.979284
prob('Prolog') = 0.01
prob('<HEAD>') = 0.97989
prob('Thanks,') = 0.0337316
prob('Prolog') = 0.01
prob('Python') = 0.01
prob('NAME=3D"GENERATOR"') = 0.99
prob('<HTML>') = 0.99
prob('</HTML>') = 0.989494
prob('</BODY>') = 0.987429
prob('Thanks,') = 0.0337316
prob('Python') = 0.01

Note that '<META' gets penalized 3 times.  More on that later.

+ Msgs talking *about* HTML, and including HTML in examples.  This one
  may be troublesome, but there are mercifully few of them.

+ Brief msgs with obnoxious employer-generated signatures.  Example:

Hi there,

I am looking for you recommendations on training courses available in the UK
on Python.  Can you help?


Vickie Mills
IS Training Analyst

Tel:    0131 245 1127
Fax:    0131 245 1550
E-mail:    vickie_mills@standardlife.com

For more information on Standard Life, visit our website
http://www.standardlife.com/   The Standard Life Assurance Company, Standard
Life House, 30 Lothian Road, Edinburgh EH1 2DH, is registered in Scotland
(No SZ4) and regulated by the Personal Investment Authority.  Tel: 0131 225
2552 - calls may be recorded or monitored.  This confidential e-mail is for
the addressee only.  If received in error, do not retain/copy/disclose it
without our consent and please return it to us.  We virus scan all e-mails
but are not responsible for any damage caused by a virus or alteration by a
third party after it is sent.

The scoring:

prob = 0.98654879055
prob('our') = 0.928936
prob('sent.') = 0.939891
prob('Tel:') = 0.0620155
prob('Thanks,') = 0.0337316
prob('received') = 0.940256
prob('Tel:') = 0.0620155
prob('Hi') = 0.0533333
prob('help?') = 0.01
prob('Personal') = 0.970976
prob('regulated') = 0.99
prob('Road,') = 0.01
prob('Training') = 0.99
prob('e-mails') = 0.987542
prob('Python.') = 0.01
prob('Investment') = 0.99

The brief human-written part is fine, but the longer boilerplate sig is
indistinguishable from spam.

+ The occassional non-Python conference announcement(!).  These are
  long, so I'll skip an example.  In effect, it's automated bulk email
  trying to sell you a conference, so is prone to use the language and
  artifacts of advertising.  Here's typical scoring, for the TOOLS
  Europe '99 conference announcement:

prob = 0.983583974285
prob('THE') = 0.983584
prob('Object') = 0.01
prob('Bell') = 0.01
prob('Object-Oriented') = 0.01
prob('**************************************************************') =
prob('Bertrand') = 0.01
prob('Rational') = 0.01
prob('object-oriented') = 0.01
prob('CONTACT') = 0.99
prob('**************************************************************') =
prob('innovative') = 0.99
prob('**************************************************************') =
prob('Olivier') = 0.01
prob('VISIT') = 0.99
prob('OUR') = 0.99

Note the repeated penalty for the lines of asterisks.  That segues into the
next one:

+ Artifacts of that the algorithm counts multiples instances of "a word"
  multiple times.  These are baffling at first sight!  The two clearest

> > Can you create and use new files with dbhash.open()?
> Yes. But if I run db_dump on these files, it says "unexpected file type
> or format", regardless which db_dump version I use (2.0.77, 3.0.55,
> 3.1.17)

It may be that db_dump isn't compatible with version 1.85 databse files.  I
can't remember.  I seem to recall that there was an option to build 1.85
versions of db_dump and db_load.  Check the configure options for
BerkeleyDB to find out.  (Also, while you are there, make sure that
BerkeleyDB was built the same on both of your platforms...)

> >  Try running db_verify (one of the utilities built
> > when you compiled DB) on the file and see what it tells you.
> There is no db_verify among my Berkeley DB utilities.

There should have been a bunch of them built when you compiled DB.  I've got

-r-xr-xr-x  1 rd       users     343108 Dec 11 12:11 db_archive
-r-xr-xr-x  1 rd       users     342580 Dec 11 12:11 db_checkpoint
-r-xr-xr-x  1 rd       users     342388 Dec 11 12:11 db_deadlock
-r-xr-xr-x  1 rd       users     342964 Dec 11 12:11 db_dump
-r-xr-xr-x  1 rd       users     349348 Dec 11 12:11 db_load
-r-xr-xr-x  1 rd       users     340372 Dec 11 12:11 db_printlog
-r-xr-xr-x  1 rd       users     341076 Dec 11 12:11 db_recover
-r-xr-xr-x  1 rd       users     353284 Dec 11 12:11 db_stat
-r-xr-xr-x  1 rd       users     340340 Dec 11 12:11 db_upgrade
-r-xr-xr-x  1 rd       users     340532 Dec 11 12:11 db_verify

Robin Dunn
Software Craftsman
http://wxPython.org     Java give you jitters?
http://wxPROs.com        Relax with wxPython!

Looks utterly on-topic!  So why did Robin's msg get flagged?  It's solely
due to his Unix name in the ls output(!):

prob = 0.999999999895
prob('Berkeley') = 0.01
prob('configure') = 0.01
prob('remember.') = 0.01
prob('these:') = 0.01
prob('recall') = 0.01
prob('rd') = 0.99
prob('rd') = 0.99
prob('rd') = 0.99
prob('rd') = 0.99
prob('rd') = 0.99
prob('rd') = 0.99
prob('rd') = 0.99
prob('rd') = 0.99
prob('rd') = 0.99
prob('rd') = 0.99

Spammers often generate random "word-like" gibberish at the ends of msgs,
and "rd" is one of the random two-letter combos that appears in the spam
corpus.  Perhaps it would be good to ignore "words" with fewer than W
characters (to be determined by experiment).

The other example is long, an off-topic but delightful exchange between
Peter Hansen and Alex Martelli.  Here's a "typical" paragraph:

    Since it's important to use very abundant amounts of water when
    cooking pasta, the price of what is still a very cheap dish would
    skyrocket if that abundant water had to be costly bottled mineral

The scoring:

prob = 0.99
prob('"Peter') = 0.01
prob(':-)') = 0.01
prob('<peter@engcorp.com>') = 0.01
prob('tasks') = 0.01
prob('drinks') = 0.01
prob('wrote') = 0.01
prob('Hansen"') = 0.01
prob('water') = 0.99
prob('water') = 0.99
prob('skyrocket') = 0.99
prob('water') = 0.99
prob('water') = 0.99
prob('water') = 0.99
prob('water') = 0.99
prob('water') = 0.99

Alex is drowning in his aquatic excess <wink>.

I expect that including the headers would have given these much better
chances of getting through, given Robin and Alex's posting histories.
Still, the idea of counting words multiple times is open to question, and
experiments both ways are in order.

+ Brief put-ons, like


It's not actually things like WAREZ that hurt here, it's more the mere fact

prob = 0.999982095931
prob('AUTOCODING') = 0.2
prob('THING.') = 0.2
prob('DUDEZ') = 0.2
prob('ANYONE') = 0.884211
prob('GET') = 0.847334
prob('GET') = 0.847334
prob('HEY') = 0.2
prob('--') = 0.0974729
prob('KNOW') = 0.969697
prob('THIS') = 0.953191
prob('?') = 0.0490886
prob('WANT') = 0.99
prob('TO') = 0.988829
prob('CAN') = 0.884211
prob('WAREZ') = 0.2

OTOH, a lot of the Python community considered the whole autocoding thread
to be spam, and I personally could have lived without this contribution to
its legacy (alas, the autocoding thread wasn't spam, just badly off-topic).

+ Msgs top-quoting an earlier spam in its entirety.  For example,
  one msg quoted an entire Nigerian scam msg, and added just

    Aw jeez, another one of these Nigerian wire scams.  This one has
    been around for 20 years.

What's an acceptable false positive rate?  What do we get from SpamAssassin?
I expect we can end up below 0.1% here, and with a generous meaning for "not
spam", but I think *some* of these examples show that the only way to get a
0% false-positive rate is to recode spamprob like so:

    def spamprob(self, wordstream, evidence=False):
        return 0.0

That would also allow other simplifications <wink>.