[Spambayes] Matt Sergeant: Introduction

Matt Sergeant msergeant@startechgroup.co.uk
Wed, 02 Oct 2002 09:26:41 +0100


Tim Peters wrote:
> [Matt Sergeant]
> 
>>...
>>And to give back I'll tell you that one of my biggest wins was parsing
>>HTML (with HTML::Parser - a C implementation so it's very fast) and
>>tokenising all attributes, so I get:
>>
>>   colspan=2
>>   face=Arial, Helvetica, sans-serif
>>
>>as tokens. Plus using a proper HTML parser I get to parse HTML comments
>>too (which is a win).
> 
> 
> Matt, what are you using as test data?  The experience here has been that
> HTML is sooooo strongly correlated with spam that we've added gimmick after
> gimmick to remove evidence that HTML ever existed; else the rare ham that
> uses HTML-- or even *discusses* HTML with a few examples! --had an extremely
> hard time avoiding getting classified as spam.

We have a live feed from one of our towers. You have to be careful to 
classify only HTML that is actually going to be rendered as HTML by the 
client (i.e. content-type: text/html, or the whole thing is HTML which 
is a heuristic Outlook seems to use, which is infuriating).

Due to it being a live feed, we get all sorts of HTML newsletters in 
there, so only real spammy indicators get noticed, rather than HTML 
being a generic catch-all. I guess the point being that we see more HTML 
newsletters than we see HTML spam ;-)

> Do you find, for example, that
> 
>     colspan=2
> 
> is common in HTML ham but rare in HTML spam, or vice versa?

select * from words where word = 'colspan=2';
    word    | goodcount | badcount
-----------+-----------+----------
  colspan=2 |      3950 |     4197

Hmm, I guess colspan=2 wasn't a good example <grin>.

 > I'm wondering what's sparing
> you from that fate.

I suspect it's just the corpus.

>>but increases the database size and number of tokens you have to
>>pull from the database enormously.
> 
> 
> That was also our experience with word bigrams, but less than "enormously";
> about a factor of 2; character 5-grams were snuggling up to enormously.

I think for me it was more me hitting the limits of the performance I 
could expect from postgresql. Expecting 10,000 selects to come back in 
anything like a reasonable timeframe was a bit much to ask ;-)

>>Well I very quickly found out that most of the academic research into
>>this has been pretty bogus. For example everyone seems (seemed?) to
>>think that stemming was a big win, but I found it to lose every time.
> 
> We haven't tried that.  OTOH, the academic research has been on Bayesian
> classifiers, and this isn't one (despite that Paul called it one).

True, but my original classifier was bayesian (naive).

>>The one thing that still bothers me still about Gary's method is that
>>the threshold value varies depending on corpus. Though I expect there's
>>some mileage in being able to say that the middle ground is "unknown".
> 
> 
> It does allow for an easy, gradual, and effective way to favor f-n at the
> expense of f-p, or vice versa.  There was no such possibility under Paul's
> scheme, as the more training data we fed in, the rarer it was for *any*
> score not to be extremely close to 0 or extremely close to 1, and regardless
> of whether the classification was right or wrong.  Gary's method hasn't been
> caught in such extreme embarrassment yet.
> 
> OTOH, it *is*, as you say, corpus dependent, and it seems hard to get that
> across to people.  Gary has said he knows of ways to make the distinction
> sharper, but we haven't yet been able to provoke him into revealing them
> <wink>.  The central limit variations, and especially the logarithmic one,
> are much more extreme this way.

Is that central_limit_2 as you call it?

>>On my personal email I was seeing about 5 FP's in 4000, and about 20
>>FN's in about the same number (can't find the exact figures right now).
> 
> So to match the units and order of the next sentence, about 0.5% FN rate and
> 0.13% FP rate.
> 
>>On a live feed of customer email we're seeing about 4% FN's and 2% FP's.
> 
> Is that across hundreds of thousands of users?

It's just on one particular email tower, so around a few thousand I think.

> Do you know the
> correpsonding statistics for SpamAssassin?  For python.org use, I've thought
> that as long as we could keep this scheme fast, it may be a good way to
> reduce the SpamAssassin load.

I don't keep stats for SpamAssassin - we don't use it "pure" so it 
wouldn't be worth it. FWIW, I'm working on making SpamAssassin 3 
significantly faster (like about 50x) by using a decision tree rather 
than a linear scan of all rules. I think for your purposes (python.org 
mailing lists) there's probably a lot of mileage in doing spambayes 
first, then if spambayes is unsure (say between .40 and .60) run the 
email through spamassassin (but set the threshold to 7).

Matt.