[Spambayes] Seeking a giant idle machine w/ a miserable corpus

Sat Nov 16 04:14:44 2002

Robert Woodhead mentioned an idea for using both unigrams and bigrams that
might help, with a twist to avoid generating highly correlated clues.

Gary Robinson was independently thinking along the same lines, and offline
sketched a more fleshed-out similar scheme for doing this with unigrams,
bigrams and trigrams.

I implemented the latter but in a somewhat "purer" form.  A patch for
classifier.py is attached.

Now I don't have any data that can show improvements, so whether this might
help beats me.  It wasn't a disaster for me, which is saying something,
since previous ideas along these lines were clearly steps backward (as
measured by error rates).

So I need someone who's *not* getting great results now to try it (Anthony?
Skip?).  Big caution:  this is a memory hog.  I don't have enough RAM to run
my full c.l.py test, or even half of it.  Here's from a small-subset 10-fold
CV run:

filename:   before     tri
ham:spam:  3000:3000
                   3000:3000
fp total:        0       0
fp %:         0.00    0.00
fn total:        0       0
fn %:         0.00    0.00
unsure t:       26      42
unsure %:     0.43    0.70
real cost:   $5.20   $8.40
best cost:   $0.00   $0.00
h mean:       0.37    0.50
h sdev:       3.07    3.77
s mean:      99.92   99.87
s sdev:       1.49    2.06
mean diff:   99.55   99.37
k:           21.83   17.04

Judging from the error rates, it's got nothing going for it or against it.
Why it *might* help:  while "Python" is a very strong ham word in my tests,
"Python Video" is a porn vendor, and this scheme should reliably know the
difference.  Etc.  My data isn't hard enough for it to matter.

If this really helps someone, then a number of things follow:  cut it off
with bigrams instead; boost it to 4-grams instead; if more than bigrams are
needed for it to help, buy into some hashing scheme to make the database
burden finite again.

As I saw before with pure bigrams, conference announcements once again move
into high-scoring territory, but not nearly so bad.  For example, the OSCON
2000 announcement got penalized for

prob('electronic mail') = 0.969799
prob('and companies') = 0.973373
prob('last name:') = 0.973373
prob('the completion') = 0.973373
prob('individuals who') = 0.976644
prob('cutting edge') = 0.978469
prob('fax the') = 0.978469
prob('target audience') = 0.978469
prob('the subject line') = 0.980893
prob('send all') = 0.981928
prob('not accepted.') = 0.991493
prob('with marketing') = 0.992611
prob('your email') = 0.996391
prob('will receive') = 0.997

but also got helped by

prob('note that the') = 0.0145631
prob('the call') = 0.0167286
prob('the tutorial') = 0.0167286
prob('problems that') = 0.0302013
prob('the open source') = 0.0348837
prob('tutorial and') = 0.0412844
prob('sent via') = 0.0608351
prob('other open') = 0.0652174
prob('proposals for') = 0.0652174
prob('text with') = 0.0652174
prob('the convention') = 0.0652174
prob('with open') = 0.0652174
prob('and open') = 0.0918367
prob('convention the') = 0.0918367
prob('for programmers,') = 0.0918367
prob('itself and the') = 0.0918367
prob('open source software') = 0.0918367
prob('source software') = 0.0918367
prob('that leads') = 0.0918367
prob('wide variety') = 0.0918367

In the end, it was highly ambiguous, with

prob = 0.500000084396
prob('*H*') = 1
prob('*S*') = 1