[Spambayes] Matt Sergeant: Introduction
Matt Sergeant
msergeant@startechgroup.co.uk
Wed, 02 Oct 2002 09:26:41 +0100
Tim Peters wrote:
> [Matt Sergeant]
>
>>...
>>And to give back I'll tell you that one of my biggest wins was parsing
>>HTML (with HTML::Parser - a C implementation so it's very fast) and
>>tokenising all attributes, so I get:
>>
>> colspan=2
>> face=Arial, Helvetica, sans-serif
>>
>>as tokens. Plus using a proper HTML parser I get to parse HTML comments
>>too (which is a win).
>
>
> Matt, what are you using as test data? The experience here has been that
> HTML is sooooo strongly correlated with spam that we've added gimmick after
> gimmick to remove evidence that HTML ever existed; else the rare ham that
> uses HTML-- or even *discusses* HTML with a few examples! --had an extremely
> hard time avoiding getting classified as spam.
We have a live feed from one of our towers. You have to be careful to
classify only HTML that is actually going to be rendered as HTML by the
client (i.e. content-type: text/html, or the whole thing is HTML which
is a heuristic Outlook seems to use, which is infuriating).
Due to it being a live feed, we get all sorts of HTML newsletters in
there, so only real spammy indicators get noticed, rather than HTML
being a generic catch-all. I guess the point being that we see more HTML
newsletters than we see HTML spam ;-)
> Do you find, for example, that
>
> colspan=2
>
> is common in HTML ham but rare in HTML spam, or vice versa?
select * from words where word = 'colspan=2';
word | goodcount | badcount
-----------+-----------+----------
colspan=2 | 3950 | 4197
Hmm, I guess colspan=2 wasn't a good example <grin>.
> I'm wondering what's sparing
> you from that fate.
I suspect it's just the corpus.
>>but increases the database size and number of tokens you have to
>>pull from the database enormously.
>
>
> That was also our experience with word bigrams, but less than "enormously";
> about a factor of 2; character 5-grams were snuggling up to enormously.
I think for me it was more me hitting the limits of the performance I
could expect from postgresql. Expecting 10,000 selects to come back in
anything like a reasonable timeframe was a bit much to ask ;-)
>>Well I very quickly found out that most of the academic research into
>>this has been pretty bogus. For example everyone seems (seemed?) to
>>think that stemming was a big win, but I found it to lose every time.
>
> We haven't tried that. OTOH, the academic research has been on Bayesian
> classifiers, and this isn't one (despite that Paul called it one).
True, but my original classifier was bayesian (naive).
>>The one thing that still bothers me still about Gary's method is that
>>the threshold value varies depending on corpus. Though I expect there's
>>some mileage in being able to say that the middle ground is "unknown".
>
>
> It does allow for an easy, gradual, and effective way to favor f-n at the
> expense of f-p, or vice versa. There was no such possibility under Paul's
> scheme, as the more training data we fed in, the rarer it was for *any*
> score not to be extremely close to 0 or extremely close to 1, and regardless
> of whether the classification was right or wrong. Gary's method hasn't been
> caught in such extreme embarrassment yet.
>
> OTOH, it *is*, as you say, corpus dependent, and it seems hard to get that
> across to people. Gary has said he knows of ways to make the distinction
> sharper, but we haven't yet been able to provoke him into revealing them
> <wink>. The central limit variations, and especially the logarithmic one,
> are much more extreme this way.
Is that central_limit_2 as you call it?
>>On my personal email I was seeing about 5 FP's in 4000, and about 20
>>FN's in about the same number (can't find the exact figures right now).
>
> So to match the units and order of the next sentence, about 0.5% FN rate and
> 0.13% FP rate.
>
>>On a live feed of customer email we're seeing about 4% FN's and 2% FP's.
>
> Is that across hundreds of thousands of users?
It's just on one particular email tower, so around a few thousand I think.
> Do you know the
> correpsonding statistics for SpamAssassin? For python.org use, I've thought
> that as long as we could keep this scheme fast, it may be a good way to
> reduce the SpamAssassin load.
I don't keep stats for SpamAssassin - we don't use it "pure" so it
wouldn't be worth it. FWIW, I'm working on making SpamAssassin 3
significantly faster (like about 50x) by using a decision tree rather
than a linear scan of all rules. I think for your purposes (python.org
mailing lists) there's probably a lot of mileage in doing spambayes
first, then if spambayes is unsure (say between .40 and .60) run the
email through spamassassin (but set the threshold to 7).
Matt.