[spambayes-dev] RE: [Spambayes] How low can you go?

Mon Dec 22 16:54:35 EST 2003

In message:  <LNBBLJKPBEHFEDALKOLCIEHFHPAB.tim.one at comcast.net>
             "Tim Peters" <tim.one at comcast.net> writes:
>[T. Alexander Popiel]
>
>> We shouldn't be treating just anecdotal evidence from running changed
>> code with our ongoing live mail feeds as the best we can do.
>
>We're really not, Alex.  It's just a source of ideas to try, and nothing has
>changed as a result of it (some experimental, non-default options have been
>added, but that's it).

You're right, and I'm being overly emphatic.  The significant work
over the last year has almost entirely been with the Outlook integration;
the original core of the project has gone fairly dormant.  For UI stuff,
you don't need rigor (unless you're Don Norman), and I've been letting
some of that bleed over into my perception of all of the recent progress.

>I did check in a few material changes to tokenizer.py over the last year
>without full-scale testing.  These were all in the nature of untangling HTML
>obfuscations, so that the classifier got a better idea of what the human
>email reader sees, instead of tokenizing mountains of raw numeric character
>entities, nonsense tags, and other coding tricks unique to HTML.

These are definitely good changes.  The header whitespace
normalization that's been suggested in a separate thread may
also be, though I'm less certain of that one; since the vast
majority of people don't look at the headers, I suspect there's
a greater chance of something quirky but useful there that'd be
obscured by the normalization.  (I suppose it depends on whether
intermediate mailservers unwrap and rewrap the headers...)

>> I'm spoiled by doing most of my mail handling in an
>> environment which encourages treating mail as data to be arbitrarily
>> processed, instead of just viewed through a gui.
>
>OTOH, Outlook users are spoiled by that GUI, deeply integrated with
>spambayes.  It's truly a joy to use, day-to-day.  Training spambayes
>effectively via the Outlook UI remains more than a bit of a puzzle, though,
>and that extends in part to everyone who isn't prepared to retrain from
>scratch at the drop of a pin.  There's a growing disconnect that way between
>what developers are happy to do, and what "real users" are able to tolerate.
>That's worth some thought too.

I don't use a gui at all from my normal mail, so I really don't
know what it would be like to have spambayes 'tightly integrated'.
As it is, I've got a couple folders set up for spambayes use, and
some procmail stuff... but any retraining or corrections only take
effect in my nightly rebuild-the-database-from-scratch, unless I
go out of my way to kick off a rebuild early.

I think the biggest disconnect by far is whether or not people are
willing to keep every single piece of mail they get for months or
years at a time.  That's what I'm doing now... but I think I can
count the number of people who do that on one hand.

The next test that I'm actually interested in doing is a comparison
between training on everything and training on everything that isn't
1.00 or 0.00 (rounded).  I may post a regime for that shortly.

- Alex