[Spambayes] Seeking a giant idle machine w/ a miserable corpus
Tim Peters
tim.one@comcast.net
Sat Nov 16 04:14:44 2002
Robert Woodhead mentioned an idea for using both unigrams and bigrams that
might help, with a twist to avoid generating highly correlated clues.
Gary Robinson was independently thinking along the same lines, and offline
sketched a more fleshed-out similar scheme for doing this with unigrams,
bigrams and trigrams.
I implemented the latter but in a somewhat "purer" form. A patch for
classifier.py is attached.
Now I don't have any data that can show improvements, so whether this might
help beats me. It wasn't a disaster for me, which is saying something,
since previous ideas along these lines were clearly steps backward (as
measured by error rates).
So I need someone who's *not* getting great results now to try it (Anthony?
Skip?). Big caution: this is a memory hog. I don't have enough RAM to run
my full c.l.py test, or even half of it. Here's from a small-subset 10-fold
CV run:
filename: before tri
ham:spam: 3000:3000
3000:3000
fp total: 0 0
fp %: 0.00 0.00
fn total: 0 0
fn %: 0.00 0.00
unsure t: 26 42
unsure %: 0.43 0.70
real cost: $5.20 $8.40
best cost: $0.00 $0.00
h mean: 0.37 0.50
h sdev: 3.07 3.77
s mean: 99.92 99.87
s sdev: 1.49 2.06
mean diff: 99.55 99.37
k: 21.83 17.04
Judging from the error rates, it's got nothing going for it or against it.
Why it *might* help: while "Python" is a very strong ham word in my tests,
"Python Video" is a porn vendor, and this scheme should reliably know the
difference. Etc. My data isn't hard enough for it to matter.
If this really helps someone, then a number of things follow: cut it off
with bigrams instead; boost it to 4-grams instead; if more than bigrams are
needed for it to help, buy into some hashing scheme to make the database
burden finite again.
As I saw before with pure bigrams, conference announcements once again move
into high-scoring territory, but not nearly so bad. For example, the OSCON
2000 announcement got penalized for
prob('electronic mail') = 0.969799
prob('and companies') = 0.973373
prob('last name:') = 0.973373
prob('the completion') = 0.973373
prob('individuals who') = 0.976644
prob('cutting edge') = 0.978469
prob('fax the') = 0.978469
prob('target audience') = 0.978469
prob('the subject line') = 0.980893
prob('send all') = 0.981928
prob('not accepted.') = 0.991493
prob('with marketing') = 0.992611
prob('your email') = 0.996391
prob('will receive') = 0.997
but also got helped by
prob('note that the') = 0.0145631
prob('the call') = 0.0167286
prob('the tutorial') = 0.0167286
prob('problems that') = 0.0302013
prob('the open source') = 0.0348837
prob('tutorial and') = 0.0412844
prob('sent via') = 0.0608351
prob('other open') = 0.0652174
prob('proposals for') = 0.0652174
prob('text with') = 0.0652174
prob('the convention') = 0.0652174
prob('with open') = 0.0652174
prob('and open') = 0.0918367
prob('convention the') = 0.0918367
prob('for programmers,') = 0.0918367
prob('itself and the') = 0.0918367
prob('open source software') = 0.0918367
prob('source software') = 0.0918367
prob('that leads') = 0.0918367
prob('wide variety') = 0.0918367
In the end, it was highly ambiguous, with
prob = 0.500000084396
prob('*H*') = 1
prob('*S*') = 1
More information about the Spambayes
mailing list