[Spambayes] Seeking a giant idle machine w/ a miserable corpus
Tim Peters
tim.one@comcast.net
Sat Nov 16 22:17:40 2002
I ran my fat c.l.py test w/ the hash space clamped at 256K buckets. That
was clearly a bad idea for that test, since there are about 330K unique
unigrams in that corpus (let alone bigrams and trigrams).
cv below is the current all-default result on that test data, excepting for
[Tokenizer]
replace_nonascii_chars: True
record_header_absence: True
The # of unsures is lower than I reported before: by staring at the
unsures, I found 10 entirely empty (0 bytes) files in my spam corpus. Those
got replaced with random spam from the reservoir (the empty msgs had scored
as unsure).
All other runs here are on the same data.
tri19 is the hashed trigram gimmick with the hash space boosted to 512K (19
bits of hash code). Contrary to expectations, the Unsure rate actually
increased over the run with 256K buckets. But it still appeared to be due
to unlucky hash collisions.
So tri20 boosted the # of hash buckets to a million. That still didn't
help.
At that point I switched body tokenization strategy: I've long speculated
that split-on-whitespace helped us over alphanumeric-run tokenization
because s-o-w captures a *little* contextual information from the
punctuation, and because it generates highly correlated clues in a way that
*helps* (like "Python" and "Python?" count as distinct words). But if we're
getting context and helpful correlation from bigrams and trigrams too, it
seems plausible that the punctuation context gets in the way. So tri20a is
with a million hash buckets, but tokenzing via re.findall with
[\w$\-\x80-\xff]+
instead of s-o-w. Alas, overall its "best cost" was even worse than
tri19's. s-o-w still rules.
So tri21 went back to s-o-w, but boosted the # of hasn buckets to 2 million.
This finally started moving "in the right direction" again, but still loses
to the original unhashed "exact" unigram scheme.
Since I probably have more than a million unique unigrams + bigrams +
trigrams (viewed as text strings) in this data, 2 million hash buckets is
certainly *not* excessive. I expect it would do better with a lot more.
But, even with the hash trickery, at 2M buckets I'm again pushing the limit
of my RAM on the fat test (which trains on more than 30,000 msgs per run).
So pushing this more would require a different database structure. So far
the results aren't good enough to make me keen to pursue it.
filename: cv tri19 tri20 tri20a tri21
ham:spam: 20000:14000 20000:14000 20000:14000
20000:14000 20000:14000
fp total: 3 0 0 0 0
fp %: 0.01 0.00 0.00 0.00 0.00
fn total: 0 7 8 3 2
fn %: 0.00 0.05 0.06 0.02 0.01
unsure t: 91 926 1128 1133 854
unsure %: 0.27 2.72 3.32 3.33 2.51
real cost: $48.20 $192.20 $233.60 $229.60 $172.80
best cost: $17.80 $36.60 $39.60 $51.00 $38.60
h mean: 0.24 0.30 0.20 0.27 0.38
h sdev: 2.73 2.25 1.93 2.38 3.00
s mean: 99.95 97.44 96.70 96.94 97.89
s sdev: 1.40 10.17 11.61 10.95 9.12
mean diff: 99.71 97.14 96.50 96.67 97.51
k: 24.14 7.82 7.13 7.25 8.05
The FN under all hashed schemes are mostly long spam in foreign languages,
and *which* of those are judged ham varies across runs (changing the # of
hash buckets, and/or the tokenization strategy, changes the set of
accidental hash collisions). Because they're long they generate lots of
hash codes; because they're foreign languages, the hash codes hit accidental
matches; do that often enough and you're bound to get something that looks
like solid ham. In tri21, the lowest-scoring FN was at 0.01, and happened
to be a long spam in what looks like Polish. Non-hashing schemes are immune
to this (brand new words are ignored, and the header clues dominate the
score, which is usually enough to nail it as spam).
The increase in Unsures appears to be almost entirely due to spam. Here's
the ham score distro (in tri21) near 50:
47.0 0
47.5 0
48.0 1 *
48.5 2 *
49.0 0
49.5 1 *
50.0 1 *
and no ham scored higher than that. The spam score distro hear 50:
47.0 2 *
47.5 3 *
48.0 2 *
48.5 3 *
49.0 7 *
49.5 104 *
50.0 343 **
50.5 40 *
51.0 29 *
51.5 20 *
52.0 7 *
52.5 18 *
53.0 18 *
I don't know why that is (well, yes, it's a huge increase in "cancellation
disease" in spams, but I don't know *why* there's a huge cancellation
disease increase for spam but not for ham).
The quote of the Nigerian scam spam was the highest-scoring ham, scoring
exactly 0.5, with H=1 and S=1. The H=1 appeared mostly due to extremely
strong ham clues in the headers, the strongest being:
prob('header:Subject:1 noheader:received noheader:x-abuse-info') =
4.88234e-005
Unfortunately, it's impossible to say whether that's "real" or was just a
hash accident. It's pretty clear that this "ham clue" was an accident:
prob('7597133 federal ministry') = 0.0505618
and this even more so:
prob('housing (fmwh) nigeria.') = 0.0238095
The chance of this crap decreases as the # of hash buckets increases, but
increases the more training data you've got too.
better-the-devil-you-know?-ly y'rs - tim
More information about the Spambayes
mailing list