[Spambayes] Matt Sergeant: Introduction

Matt Sergeant msergeant@startechgroup.co.uk
Tue, 01 Oct 2002 10:18:13 +0100


Tim Peters wrote:
> [Matt Sergeant]
> 
> Thanks for the introduction, Matt!  Welcome.
> 
> 
>>...
>>Like you all, I discovered very quickly that it's the tokenisation
>>techniques that are the biggest "win" when it comes down to it.
> 
> The first thing I tried after implementing Graham's scheme was special
> tokenization and tagging of embedded http/https/ftp thingies.

Consider that adopted ;-)

And to give back I'll tell you that one of my biggest wins was parsing 
HTML (with HTML::Parser - a C implementation so it's very fast) and 
tokenising all attributes, so I get:

   colspan=2
   face=Arial, Helvetica, sans-serif

as tokens. Plus using a proper HTML parser I get to parse HTML comments 
too (which is a win).

Using word tuples is also a small win, but increases the database size 
and number of tokens you have to pull from the database enormously. 
That's an issue for me because I'm not using an in-memory database (one 
implementation uses CDB, another uses SQL - the SQL one is really nice 
because you can so easily do data mining, and the code to extract the 
token probabilities is just a view).

> That
> instantly cut the false negative rate in half.  It remains the single
> biggest win we ever got.

Well I very quickly found out that most of the academic research into 
this has been pretty bogus. For example everyone seems (seemed?) to 
think that stemming was a big win, but I found it to lose every time.

> The rest has been an aggregation of many smaller
> wins, and the benefit gotten over time from finding and removing the biases
> in Paul's formulation has been highly significant.  That eventually hit a
> wall,where this set of 3 artificialities was stubborn:
> 
>     artificially clamping spamprobs into [0.01, 0.99]
>     artificially boosting ham counts
>     looking at only the 16 most-extreme words
> 
> Changing any one, or any two, of those, gave at best mixed results.  It took
> wholesale adoption of all of Gary Robinson's ideas at once (some of which
> aren't really explained (yet?) on his webpage) to nuke them all.  The fewer
> the number of "mystery knobs", the better results have gotten, but the
> original biases sometimes acted to cancel each other out in the areas they
> hurt most, so you can't get here from there removing just one at a time.

(I've followed this all so far in read-only mode, but thanks for 
rounding it up into 2 paragraphs <grin>).

The one thing that still bothers me still about Gary's method is that 
the threshold value varies depending on corpus. Though I expect there's 
some mileage in being able to say that the middle ground is "unknown".

>>so I'm hopefully going to get CLT done this week and see how it fares.
>>Unfortunately I find python incredibly difficult to read, so it takes
>>me a while!
> 
> 
> Hmm.  I could tell you to mentally translate
> 
>     a.b
> 
> to
> 
>     $a->{b}
> 
> but I doubt your problem is at that level <wink>.  Post a snippet of Python
> you find "incredibly difficult to read", and someone will be happy to walk
> you thru it.  I really can't guess, as this particular criticism of Python
> is one I've never heard before!

OK, I'll go over it again this week and next time I get stuck I'll mail 
out for some help ;-) The hardest part really is getting from how my 
code is structured (i.e. where I get my data from, how I store it, etc) 
to your version. Simple examples like where you use a priority queue for 
the probabilities so you can extract the top N indicators, I just use an 
array, and use a sort to get the top N. So mostly it's just the details 
of storage that confuse me.

Oh, and not being able to figure out where a block ends :-P

Off the top of my head, what does frexp() do?

And where is compute_population_stats used?

>>...
>>such as how the probability stuff works so much better on individuals'
>>corpora (or on a particular mailing list's corpus) than it does for
>>hundreds of thousands of users.
> 
> That's been my suspicion, but we haven't tested it here yet.  So save us the
> effort and tell us the bottom line from your tests <wink>.

On my personal email I was seeing about 5 FP's in 4000, and about 20 
FN's in about the same number (can't find the exact figures right now). 
On a live feed of customer email we're seeing about 4% FN's and 2% FP's.

I don't yet have your fancy histograms, mostly because the code works on 
one email in isolation right now, and knows nothing about what result it 
should have given - I need to write wrappers to do that stuff yet.