[spambayes-dev] RE: [Spambayes] How low can you go?

Mon Dec 22 15:10:38 EST 2003

[T. Alexander Popiel]
> ...
> Again, if we're trying to get reproducible results, then I think that
> the main DB and such is the wrong place to be starting.

Right!

> We shouldn't be treating just anecdotal evidence from running changed
> code with our ongoing live mail feeds as the best we can do.

We're really not, Alex.  It's just a source of ideas to try, and nothing has
changed as a result of it (some experimental, non-default options have been
added, but that's it).

> While the Outlook plugin has done wonders for our popularity, it seems
> to have utterly destroyed our rigor.

I'm still comfortable with what's been checked in.  While there's been
massive refactoring of the code, very little has changed in how messages get
tokenized and scored.  Nothing material has changed in classifier.py, except
for removing experimental_ham_spam_imbalance_adjustment support, and there
was plenty of evidence that that gimmick hurt more than it helped, and more
so the more unbalanced training got.  It was a proven loser (since I wrote
it to begin with, I'm biased in its favor <wink>).

I did check in a few material changes to tokenizer.py over the last year
without full-scale testing.  These were all in the nature of untangling HTML
obfuscations, so that the classifier got a better idea of what the human
email reader sees, instead of tokenizing mountains of raw numeric character
entities, nonsense tags, and other coding tricks unique to HTML.  That was
driven by staring at low-scoring unsures, and identifying tricks that had no
purpose beyond disguising the rendered content.  Tests (on my own email and
on my original large test data) showed that de-obfuscating that stuff was a
pure win, so I was willing to risk that much.

I'm hard pressed to think of other default behavior that's changed.

> People now typically don't have the slightest clue how to go from
> their normal usage to a testing deployment...  or at least don't know
> how to extract their mail from Outlook's clutches so that they have
> data to work _on_.

That's for sure, and is one reason nothing else material *has* been checked
in.  Mark knows how to extract email from Outlook for usable testing, and
wrote some code to help do that, but I haven't yet had time to figure out
how it's done myself.  I'm sure very few Outlook users have.  I agree that
needs to change.  I've been speculating about lots of stuff lately, but I
have no intention of checking in any of that as default behavior without
full-blown, multi-corpus rigorous testing.

> As I don't use Outlook in any environment where I see spam, I don't
> know how to write the newbie guide to fix this... if indeed it is
> fixable.  I'm spoiled by doing most of my mail handling in an
> environment which encourages treating mail as data to be arbitrarily
> processed, instead of just viewed through a gui.

OTOH, Outlook users are spoiled by that GUI, deeply integrated with
spambayes.  It's truly a joy to use, day-to-day.  Training spambayes
effectively via the Outlook UI remains more than a bit of a puzzle, though,
and that extends in part to everyone who isn't prepared to retrain from
scratch at the drop of a pin.  There's a growing disconnect that way between
what developers are happy to do, and what "real users" are able to tolerate.
That's worth some thought too.