[Spambayes] Introduction to list: Bill Yerazunis

Tim Peters tim_one@email.msn.com
Wed Nov 27 07:20:35 2002


[Bill Yerazunis]
> I should post an introduction for myself.
>
> I'm Bill Yerazunis, and I'm doing spamfiltering.

Hi, Bill -- nice to see you!

> Robert Woodhead and Paul Graham sent me.
>
> I wrote CRM114 (which hashes phrases as "features" and does Bayesian
> chain-rule evaluation), it seems to work well for me but I hear
> that some folks here had big problems with it.

I ran a number of experiments inspired by CRM114 after Gary Robinson asked
me to take a look, but have not used your original software, and don't
recall any other reports about it on this list.

The experimental results weren't competitive with the code we've got now,
but there could be any number of reasons for that.  The chief suspect in my
eyes was that my main test trains on over 30K msgs per run, generating more
than 320K unique tokens, and multiplying that by 16 leaves one hell of a lot
of hash codes to slam into 1M buckets.  I expect, but don't know, that you
must train on a lot less data.

Other variants we tried included cutting back from subsets of 5-grams to
subsets of 3-grams; boosting the # of buckets to 2M; doing exact
(non-hashing) accounting for subsets of 3-grams and 2-grams; using less
training data; and using (what's called here) chi-combining of spamprobs
instead of Bayesianish combining.  Of those, the ones that helped most were
avoiding hashing, and using chi-combining.  They didn't help enough to
justify switching directions here (they didn't get as good as what we
already had).

A variant suggested by Gary appeared to work *as well* as what we do now,
focused on finding high-value non-overlapping multi-word phrases.  Not
enough people tested that to say for sure, and based on the test results we
got there wasn't a good case to be made that it was better or worse than our
default scheme -- it looked like a wash, based on error rates.  But it was
more expensive and required a bigger database, so nobody pursued it.

I expect it's impossible to compare schemes convincingly without a shared
test set.  We've got people here with very easy data, and with
excruciatingly difficult data, but for the most part only the spam is
sharable.  My main test turned out to be on the easy end, and the ham in
that test consists of 20,000 msgs taken at random from a public archive of
comp.lang.python mailing-list traffic.  In theory, anyone could use that
ham, but the spam has to be taken from a different source, and that creates
all sorts of problems of its own (there are too many clues in the headers
about the source of msgs to avoid getting great results for bad reasons,
unless great care is taken to blind the classifier to such clues).

The best things I saw in the CRM114-like approaches is that they learned
very quickly, and that the hashing versions had bounded database size.  The
worst thing I saw is that "naive Bayesian" prob combining relies on an
assumption of word independence, and generating ~16 "words" per input word
violates that assumption massively.  So when the scheme is wrong, it's
spectactularly wrong, giving "a probability" closer to 0 or 1 than the
chance that the universe will vanish within the next nanosecond <wink>.
chi-combining is a good way to sidestep that outcome, but the extreme
cross-word correlation violates its theoretical underpinnings too.  In the
hashing versions, unfortunate collisions caused some false positives that
were simply outrageous to human eyes.

What we've found so far is that unigrams produced by gonzo tokenization (we
tokenize different things in different ways) learn slower than some other
approaches, but that as the # of training msgs increases, it hasn't yet been
possible to beat them.

On my main test with 20,000 ham and 14,000 spam, our unigram scheme
currently has no FN, 3 FP, and 93 unsure.  The latter are msgs where
chi-combining can't decide whether a thing is ham or spam:  the amount of
evidence in each direction appears about the same.  One of the FP is a quote
of an entire Nigerian scam spam, with a one-line comment at the start like
"Ah, jeez, here's another Nigerian wire scam -- this one has been around for
20 years".  It would be an FP under CRM114 too, unless CRM114 is broken
<wink>.  Another consists of the one-word msg "subscribe", followed by an ad
for the web-based email system the poster used to send the msg.  The third
is a brief on-topic question followed by a long and obnoxious
employer-generated sig, talking about how they're a regulated investment
company, that the info therein is confidential, visit their website for more
info, etc.  All three are indeed ham to human eyes, but statistically
they're indistiguishable from spam (and I don't care what statistical
gimmick is used to analyze them -- the ham content in each is tiny compared
to the advertising/scam content).

"Unsures" are harder to characterize.  Things that often end up there
include:

+ Conference announcements (and it's often hard for people to decide
  which of these are ham and which spam!).

+ Long tech email in mixed languages (e.g., the Russian parts get scored
  as spammy because there's a lot more Russian in the spam than in the
  ham).

+ Long, chatty, "just folks" spam, written as if by a friend.  This is
  still blessedly rare.

Having stared at these for a couple of months now, I'm convinced no
statistical scheme is going to classify them reliably and correctly.  It's a
remarkable property of chi-combining that it's good at getting confused
about the msgs our human testers have found ambiguous.  When a msg gets
scored as unsure, people are usually sympathetic ("hmm, ya, that *is* an odd
msg!").  Porn spam and Korean spam never scores unsure <wink>.

Rhetorical question:  are you able to share your test data?  There are a
number of sub-1% (error rate) schemes kicking around now, and no clear way
to compare them.  Indeed, even for a single scheme, when the error rates get
so low it's darned hard to say for sure whether a change is an improvement
or just a statistical glitch.  One thing that's helped this project a lot is
having multiple testers with different data, and a shared testing framework.
We can't share our data, but people aren't shy about sharing bad results
<wink>.




More information about the Spambayes mailing list