[Spambayes] Introducing myself

Tue Nov 12 01:27:03 2002

[Robert Woodhead]
> ...
> It seems to me that you're at the point where testing the effects of
> data reduction techniques would be fruitful.

Bootstrapping a classifier, connecting to a gazillion quirky email clients,
and testing training strategies are all current high priorities.  Saving
memory wouldn't buy me anything in the Outlook client I'm using, or in the
high-volume python.org application.  But, as I said, other people are keener
on that, and I expect that reducing the sheer number of tokens is a more
effective approach (in part because it ties into effective training
strategies over time -- the database will just keep growing (albeit at a
slackening pace) without active pruning, and whether a token takes one byte
or 50).

> Once I get up and running on the code (just paid  the tithe to O'Reilly)
> I'll test it out.

It's all yours <wink>.

> One thing that occurred to me: now that you have something that seems
> to work pretty well, have you considered backtracking on particular
> features to see how much they contribute; for example, going to a
> trivial state machine parser to spit out tokens?

In theory, all prior decisions should be revisited after every change.  I
haven't done anything like that lately, though, in part because no previous
"let's revisit this!" experiment ever paid off.

Note that the bulk of the body tokenizer couldn't be simpler:

1. Convert to lowercase.
2. Split on whitespace.

Well, we *could* skip #1, but previous experiments found that it didn't give
better error rates but did increase the database size.  It did change the
*kinds* of errors, though, and in particular conference announcements had a
hard time getting thru when case was preserved (they're trying to sell you a
conference, and often SCREAM ABOUT IT).

> ...
> Yeah, we old farts ("When I was a lad, the bytes only had 6 bits!")

They had 6 or 9 when I was a lad, depending on how you set the control bit
for the Univac 1108's 36-bit words.

> have lots of tricks.  We don't so much write code as remember it and
> retype it.

You don't want to bet on who'e older here <wink>.

> ...
> Not really; it doesn't really matter what the format of a token
> coming out of the parser is, does it?

The classifier is happy with any immutable and hashable Python object, i.e.
anything that can be used as a Python dict key.  But people grafting various
databases onto this have stronger requirements, and they're not always
clear.  As I mentioned last time, most "lightweight" databases require
string keys, so any switch away from strings would break those systems.
It's pre-alpha code, but still I'm not keen to rock anyone's boat unless
there's a clear win in return.

> ...
> True; then it becomes a game of finding generic messages that are
> likely to evaluate as hammy enough to the average recognizer.  And
> the meta-response is to send out multiple emails with differently
> tuned slices of ham.

They can try.  Spam doesn't need to be stopped, though, it merely has to be
made more costly to send than it brings back.

Last week Jeremy and Guido here both reported a *very* effective technique:
spam was sent to them as replies to mailing-list postings (not this mailing
list <wink>) they had made, including a full quote of the msg they had
posted.  That was guaranteed to have lots of ham words for them, and the
Subject line was the expected "Re:" followed by their own subject line.

I doubt they're going to get a response rate high enough to be able to
afford this scheme over time, at least not on tech mailing lists.  We'll
see; if they can, it's going to be hard to beat.

> I hereby, btw, coin the term "Dagwood" (or perhaps it should be
> Wooddag?) to mean an email containing artfully sliced amounts of ham,
> spam, and html condiments.  ;^)

Cool!  Dagwood it is.

> ...
> Well, what you'd need is a hacked HTML renderer that output sets that
> look like (token,size,color,background) and ignored words that were
> too small or hard to read.

Sure.  I expect the quickest path would be to feed the source thru a
text-only browser, and stare its output.  That seems mondo expensive,
though,

>> For goodness sake, this is email we're talking about -- anyone
>> trusting a truly critical msg to email is dreaming to begin with.

> Unfortunately, in the real world, this happens all too often.  Keep
> in mind that the readers of this list are not the typical users of
> the resulting software techniques.

I do, but it's still not my problem <0.5 wink>.  All non-trivial systems
have non-zero FP rates, and that's a fact of life.  You're keen on
whitelists, but they wouldn't do a thing to stop any of the false positives
I've seen, and so on; a multitude of schemes may reduce the overall error
rates if they're combined intelligently, but they're not going to reach an
error rate of 0.  Not even with human review (as has become obvious to
everyone who's run a good system over their supposedly clean ham and spam
collections).  At some point, learning that Santa Claus isn't actually a
white man is a part of growing up <wink>.

show-me-an-isp-that-guarantees-email-delivery-and-we'll-get-
    rich-shorting-its-stock-ly y'rs  - tim