[Spambayes] Matt Sergeant: Introduction

Mon, 30 Sep 2002 14:43:52 -0400

[Matt Sergeant]

Thanks for the introduction, Matt!  Welcome.

> ...
> Like you all, I discovered very quickly that it's the tokenisation
> techniques that are the biggest "win" when it comes down to it.

The first thing I tried after implementing Graham's scheme was special
tokenization and tagging of embedded http/https/ftp thingies.  That
instantly cut the false negative rate in half.  It remains the single
biggest win we ever got.  The rest has been an aggregation of many smaller
wins, and the benefit gotten over time from finding and removing the biases
in Paul's formulation has been highly significant.  That eventually hit a
wall,where this set of 3 artificialities was stubborn:

    artificially clamping spamprobs into [0.01, 0.99]
    artificially boosting ham counts
    looking at only the 16 most-extreme words

Changing any one, or any two, of those, gave at best mixed results.  It took
wholesale adoption of all of Gary Robinson's ideas at once (some of which
aren't really explained (yet?) on his webpage) to nuke them all.  The fewer
the number of "mystery knobs", the better results have gotten, but the
original biases sometimes acted to cancel each other out in the areas they
hurt most, so you can't get here from there removing just one at a time.

> ...
> and I've done the Robinson stuff without the central limit theorem and
> it didn't work quite as well,

It won't unless you also remove the biases from Paul's scheme.  One of the
biggest wins we got was removing the gimmick that said "well, unless
hamcount*2 + spamcount >= 5, let's pretend we've never seen the word".  That
in particular doesn't seem to play well with Gary's combining formula.

> so I'm hopefully going to get CLT done this week and see how it fares.
> Unfortunately I find python incredibly difficult to read, so it takes
> me a while!

Hmm.  I could tell you to mentally translate

    a.b

to

    $a->{b}

but I doubt your problem is at that level <wink>.  Post a snippet of Python
you find "incredibly difficult to read", and someone will be happy to walk
you thru it.  I really can't guess, as this particular criticism of Python
is one I've never heard before!

> ...
> such as how the probability stuff works so much better on individuals'
> corpora (or on a particular mailing list's corpus) than it does for
> hundreds of thousands of users.

That's been my suspicion, but we haven't tested it here yet.  So save us the
effort and tell us the bottom line from your tests <wink>.