[Spambayes] Matt Sergeant: Introduction
Matt Sergeant
msergeant@startechgroup.co.uk
Tue, 01 Oct 2002 10:18:13 +0100
Tim Peters wrote:
> [Matt Sergeant]
>
> Thanks for the introduction, Matt! Welcome.
>
>
>>...
>>Like you all, I discovered very quickly that it's the tokenisation
>>techniques that are the biggest "win" when it comes down to it.
>
> The first thing I tried after implementing Graham's scheme was special
> tokenization and tagging of embedded http/https/ftp thingies.
Consider that adopted ;-)
And to give back I'll tell you that one of my biggest wins was parsing
HTML (with HTML::Parser - a C implementation so it's very fast) and
tokenising all attributes, so I get:
colspan=2
face=Arial, Helvetica, sans-serif
as tokens. Plus using a proper HTML parser I get to parse HTML comments
too (which is a win).
Using word tuples is also a small win, but increases the database size
and number of tokens you have to pull from the database enormously.
That's an issue for me because I'm not using an in-memory database (one
implementation uses CDB, another uses SQL - the SQL one is really nice
because you can so easily do data mining, and the code to extract the
token probabilities is just a view).
> That
> instantly cut the false negative rate in half. It remains the single
> biggest win we ever got.
Well I very quickly found out that most of the academic research into
this has been pretty bogus. For example everyone seems (seemed?) to
think that stemming was a big win, but I found it to lose every time.
> The rest has been an aggregation of many smaller
> wins, and the benefit gotten over time from finding and removing the biases
> in Paul's formulation has been highly significant. That eventually hit a
> wall,where this set of 3 artificialities was stubborn:
>
> artificially clamping spamprobs into [0.01, 0.99]
> artificially boosting ham counts
> looking at only the 16 most-extreme words
>
> Changing any one, or any two, of those, gave at best mixed results. It took
> wholesale adoption of all of Gary Robinson's ideas at once (some of which
> aren't really explained (yet?) on his webpage) to nuke them all. The fewer
> the number of "mystery knobs", the better results have gotten, but the
> original biases sometimes acted to cancel each other out in the areas they
> hurt most, so you can't get here from there removing just one at a time.
(I've followed this all so far in read-only mode, but thanks for
rounding it up into 2 paragraphs <grin>).
The one thing that still bothers me still about Gary's method is that
the threshold value varies depending on corpus. Though I expect there's
some mileage in being able to say that the middle ground is "unknown".
>>so I'm hopefully going to get CLT done this week and see how it fares.
>>Unfortunately I find python incredibly difficult to read, so it takes
>>me a while!
>
>
> Hmm. I could tell you to mentally translate
>
> a.b
>
> to
>
> $a->{b}
>
> but I doubt your problem is at that level <wink>. Post a snippet of Python
> you find "incredibly difficult to read", and someone will be happy to walk
> you thru it. I really can't guess, as this particular criticism of Python
> is one I've never heard before!
OK, I'll go over it again this week and next time I get stuck I'll mail
out for some help ;-) The hardest part really is getting from how my
code is structured (i.e. where I get my data from, how I store it, etc)
to your version. Simple examples like where you use a priority queue for
the probabilities so you can extract the top N indicators, I just use an
array, and use a sort to get the top N. So mostly it's just the details
of storage that confuse me.
Oh, and not being able to figure out where a block ends :-P
Off the top of my head, what does frexp() do?
And where is compute_population_stats used?
>>...
>>such as how the probability stuff works so much better on individuals'
>>corpora (or on a particular mailing list's corpus) than it does for
>>hundreds of thousands of users.
>
> That's been my suspicion, but we haven't tested it here yet. So save us the
> effort and tell us the bottom line from your tests <wink>.
On my personal email I was seeing about 5 FP's in 4000, and about 20
FN's in about the same number (can't find the exact figures right now).
On a live feed of customer email we're seeing about 4% FN's and 2% FP's.
I don't yet have your fancy histograms, mostly because the code works on
one email in isolation right now, and knows nothing about what result it
should have given - I need to write wrappers to do that stuff yet.