[Python-Dev] The first trustworthy <wink> GBayes results

Tim Peters tim.one@comcast.net
Mon, 02 Sep 2002 02:54:35 -0400

[Delaney, Timothy]
> On second thought - if a word-pair appears, then the separate parts should
> not be checked as separate words.
> So, If I had scores:
>     'free'              0.1
>     'beer'              0.1
>     ('want', 'free',)   0.9
>     ('free', 'beer',)   0.01
>     ('free', '!!!',)    0.99
> then the following phrases would match (case-folding) as:
>     'I want free beer!!!':
>     ('want', 'free',)   0.9
>     ('free', 'beer',)   0.01
>     'Get *** for free!!!'
>     ('free', '!!!',)    0.99
>     'I want free beer. Free the beer!!!'
>     ('want', 'free',)   0.9
>     ('free', 'beer',)   0.01
>     'free'              0.1
>     'beer'              0.1
> Damn I wish I was at home to try this out ... :(

I'm going to say a lot of stuff here, and then shut up <wink>.  I want to
move on to other things, but there's an opportunity to pass on some darned
good advice for those who can hear.

Combining pairs of words is called "word bigrams".  My intuition at the
start was that it would do better.  OTOH, my intuition also was that
character n-grams for a relatively large n would do better still.  The
latter may be so for "foreign" languages, but for this particular task using
Graham's scheme on the c.l.py tests, turns out they sucked.  A comment block
in timtest.py explains why.

I didn't try word bigrams because the f-p rate is already supernaturally
low, so there doesn't seem anything left to be gained there.  This echoes
what Graham sez on his web page:

    One idea that I haven't tried yet is to filter based on word pairs, or
    even triples, rather than individual words.  This should yield a much
    sharper estimate of the probability.

My comment with benefit of hindsight:  it doesn't.  Because the scoring
scheme throws away everything except about a dozen extremes, the
"probabilities" that come out are almost always very near 0 or very near 1;
only very short or (or especially "and") very bland msgs come out in
between.  This outcome is largely independent of the tokenization scheme --
the scoring scheme forces it, provided only that the tokenization scheme
produces stuff *some* of which *does* vary in frequency between spam and

    For example, in my current database, the word "offers" has a probability
    of .96. If you based the probabilities on word pairs, you'd end up with
    "special offers" and "valuable offers" having probabilities of .99 and,
    say, "approach offers" (as in "this approach offers") having a
    of .1 or less.

The theory is indeed appealing <wink>.

    The reason I haven't done this is that filtering based on individual
    words already works so well.

Which is also the reason I didn't pursue it.

    But it does mean that there is room to tighten the filters if spam gets
    harder to detect.

I expect it would also need a different scoring scheme then.

OK, I ran a full test using word bigrams.  It gets one strike against it at
the start because the database size grows by a factor between 2 and 3.
That's only justified if the results are better.  Before-and-after f-p
(false positive) percentages:

   before   bigrams
    0.000   0.025
    0.000   0.025
    0.050   0.050
    0.000   0.025
    0.025   0.050
    0.025   0.100
    0.050   0.075
    0.025   0.025
    0.025   0.050
    0.000   0.025
    0.075   0.050
    0.050   0.000
    0.025   0.050
    0.000   0.025
    0.050   0.075
    0.025   0.025
    0.025   0.025
    0.000   0.000
    0.025   0.050
    0.050   0.025

Lost on 12 runs
Tied on  5 runs
Won  on  3 runs

total # of unique fps across all runs rose from 8 to 17

The f-n percentages on the same runs:

   before   bigrams
    1.236   1.091
    1.164   1.091
    1.454   1.708
    1.599   1.563
    1.527   1.491
    1.236   1.127
    1.163   1.345
    1.309   1.309
    1.891   1.927
    1.418   1.382
    1.745   1.927
    1.708   1.963
    1.491   1.782
    0.836   0.800
    1.091   1.127
    1.309   1.309
    1.491   1.709
    1.127   1.018
    1.309   1.018
    1.636   1.672

Lost on  9 runs
Tied on  2 runs
Won  on  9 runs

total # of unique fns across all runs rose from 336 to 350

This doesn't need deep analysis:  it costs more, and on the face of it
either doesn't help, or helps so little it's not worth the cost.

Now I'll tell in you confidence <wink> that the way to make a scheme like
this excellent is to keep your ego out of it and let the data *tell* you
what works:  getting the best test setup you can is the most important thing
you can possibly do, which must include multiple training and test corpora
(e.g., if I had used only one pair, I would have had a 3/20 chance of
erroneously concluding that bigrams might help the f-p rate, when running
across 20 pairs shows that they almost certainly do it harm; while I would
have had an even chance of drawing a wrong conclusion-- in either
direction --about the effect on the f-n rate).

The second most important thing is to run a fat test all the way to the end
before concluding anything.  A subtler point is that you should never keep a
change that doesn't *prove* itself a winner:  neutral changes bloat your
code with proven irrelevancies that will come back to make your life harder
later, in part because they'll randomly interfere with future changes in
ways that make it harder to recognize a significant change when you stumble
into one.

Most things you try won't help -- indeed, many of them will deliver worse
results.  I dare say my intution for this kind of classification task is
better than most programmers' (in part because I had years of professional
experience in a related field), and most of the things I tried I had to
throw away.  BFD -- then you try something else.  When I find something that
works I can rationalize it, but when I try something that doesn't, no amount
of argument can change that the data said it sucked <wink>.

Two things about *this* task have fooled me repeatedly:

1. The "only look at smoking guns" nature of the scoring step makes many
   of "on average" intuitions worthless:  "on average" almost everything is
   thrown away!  For example, you're not going to find bad results reported
   for n-grams (neither character- nor word-based) in the literature, and
   because most scoring schemes throw much less away.  Graham's scheme
   as brilliant in this specific respect:  it's worth enduring the ego
   humiliation <wink> to get such a spectacularly low f-p rate from such
   simple and fast code.

2. Most mailing-list messages are much shorter than this one.  This
   systematically frustrates "well, averaged over enough words" intuitions

Cute:  In particular, word bigrams systematically hate conference
announcements.  The current word one-gram scheme hated them too, until I
started folding case.  Then their SCREAMING stopped acting against them.
But they're still using the language of advertisement, and word bigrams
can't help but notice that more strongly than individual words do.

Here from the TOOLS Europe '99 announcement:

prob('more information') = 0.916003
prob('web site') = 0.895518
prob('please write') = 0.99
prob('you wish') = 0.984494
prob('our web') = 0.985578
prob('visit our') = 0.99

Here from the XP2001 - FINAL CALL FOR PAPERS:

prob('web site:') = 0.926174
prob('receive this') = 0.945813
prob('you receive') = 0.987542
prob('most exciting') = 0.99
prob('alberta, canada') = 0.99
prob('e-mail to:') = 0.99

Here from the XP2002 - CALL FOR PRACTITIONER'S REPORTS ('BOM' is an
artificial token I made up for "beginning of message", to give something for
the first word in the message to pair up with):

prob('web site:') = 0.926174
prob('this announcement') = 0.94359
prob('receive this') = 0.945813
prob('forward this') = 0.99
prob('e-mail to:') = 0.99
prob('BOM *****') = 0.99
prob('you receive') = 0.987542

Here from the TOOLS Europe 2000 announcement:

prob('visit the') = 0.96
prob('you receive') = 0.967805
prob('accept our') = 0.99
prob('our apologies') = 0.99
prob('quality and') = 0.99
prob('receive more') = 0.99
prob('asia and') = 0.99

A vanilla f-p showing where bigrams can hurt was a short msg about setting
up a Python user's group.  Bigrams gave it large penalties for phrases like
"fully functional" (most often seen in spams for bootleg software, but here
applied to the proposed user group's web site -- and "web site" is also a
strong spam indicator!).  OTOH, the poster also said "Aahz rocks".  As a
bigram, that neither helped nor hurt (that 2-word phrase is unique in the
corpus); but as an individual word, "Aahz" is a strong non-spam indicator on
c.l.py (and will probably remain so until he starts spamming <wink>).

It did find one spam hiding in a ham corpus:

Newsgroups: comp.lang.python,comp.lang.rexx
Date: Thu, 21 Oct 1999 10:18:52 -0700
Message-ID: <67821AB23987D311ADB100A0241979E5396955@news.ykm.com>
From: znblrn@hetronet.com
Subject: Rudolph The Rednose Hooters Here
Lines: 4
Xref: news comp.lang.python:74468 comp.lang.rexx:31946
To: python-list@python.org

THis IS it: The site where they talk about when you are 50 years old.


there's-no-substitute-for-experiment-except-drugs-ly y'rs  - tim