[Spambayes] Introducing myself
Tim Stone - Four Stones Expressions
Tue Nov 12 01:32:18 2002
11/11/2002 7:27:03 PM, Tim Peters <firstname.lastname@example.org> wrote:
>> It seems to me that you're at the point where testing the effects of
>> data reduction techniques would be fruitful.
>Bootstrapping a classifier, connecting to a gazillion quirky email clients,
>and testing training strategies are all current high priorities. Saving
>memory wouldn't buy me anything in the Outlook client I'm using, or in the
>high-volume python.org application. But, as I said, other people are keener
>on that, and I expect that reducing the sheer number of tokens is a more
>effective approach (in part because it ties into effective training
>strategies over time -- the database will just keep growing (albeit at a
>slackening pace) without active pruning, and whether a token takes one byte
>> Once I get up and running on the code (just paid the tithe to O'Reilly)
>> I'll test it out.
>It's all yours <wink>.
>> One thing that occurred to me: now that you have something that seems
>> to work pretty well, have you considered backtracking on particular
>> features to see how much they contribute; for example, going to a
>> trivial state machine parser to spit out tokens?
>In theory, all prior decisions should be revisited after every change. I
>haven't done anything like that lately, though, in part because no previous
>"let's revisit this!" experiment ever paid off.
>Note that the bulk of the body tokenizer couldn't be simpler:
>1. Convert to lowercase.
>2. Split on whitespace.
This makes me wonder what happens if someone spams you with various devices
like c o n v e r t i n g wor ds into var ious c.o.m.b in a.tions of
>Well, we *could* skip #1, but previous experiments found that it didn't give
>better error rates but did increase the database size. It did change the
>*kinds* of errors, though, and in particular conference announcements had a
>hard time getting thru when case was preserved (they're trying to sell you a
>conference, and often SCREAM ABOUT IT).
>> Yeah, we old farts ("When I was a lad, the bytes only had 6 bits!")
>They had 6 or 9 when I was a lad, depending on how you set the control bit
>for the Univac 1108's 36-bit words.
>> have lots of tricks. We don't so much write code as remember it and
>> retype it.
>You don't want to bet on who'e older here <wink>.
>> Not really; it doesn't really matter what the format of a token
>> coming out of the parser is, does it?
>The classifier is happy with any immutable and hashable Python object, i.e.
>anything that can be used as a Python dict key. But people grafting various
>databases onto this have stronger requirements, and they're not always
>clear. As I mentioned last time, most "lightweight" databases require
>string keys, so any switch away from strings would break those systems.
>It's pre-alpha code, but still I'm not keen to rock anyone's boat unless
>there's a clear win in return.
>> True; then it becomes a game of finding generic messages that are
>> likely to evaluate as hammy enough to the average recognizer. And
>> the meta-response is to send out multiple emails with differently
>> tuned slices of ham.
>They can try. Spam doesn't need to be stopped, though, it merely has to be
>made more costly to send than it brings back.
>Last week Jeremy and Guido here both reported a *very* effective technique:
>spam was sent to them as replies to mailing-list postings (not this mailing
>list <wink>) they had made, including a full quote of the msg they had
>posted. That was guaranteed to have lots of ham words for them, and the
>Subject line was the expected "Re:" followed by their own subject line.
>I doubt they're going to get a response rate high enough to be able to
>afford this scheme over time, at least not on tech mailing lists. We'll
>see; if they can, it's going to be hard to beat.
>> I hereby, btw, coin the term "Dagwood" (or perhaps it should be
>> Wooddag?) to mean an email containing artfully sliced amounts of ham,
>> spam, and html condiments. ;^)
>Cool! Dagwood it is.
>> Well, what you'd need is a hacked HTML renderer that output sets that
>> look like (token,size,color,background) and ignored words that were
>> too small or hard to read.
>Sure. I expect the quickest path would be to feed the source thru a
>text-only browser, and stare its output. That seems mondo expensive,
>>> For goodness sake, this is email we're talking about -- anyone
>>> trusting a truly critical msg to email is dreaming to begin with.
>> Unfortunately, in the real world, this happens all too often. Keep
>> in mind that the readers of this list are not the typical users of
>> the resulting software techniques.
>I do, but it's still not my problem <0.5 wink>. All non-trivial systems
>have non-zero FP rates, and that's a fact of life. You're keen on
>whitelists, but they wouldn't do a thing to stop any of the false positives
>I've seen, and so on; a multitude of schemes may reduce the overall error
>rates if they're combined intelligently, but they're not going to reach an
>error rate of 0. Not even with human review (as has become obvious to
>everyone who's run a good system over their supposedly clean ham and spam
>collections). At some point, learning that Santa Claus isn't actually a
>white man is a part of growing up <wink>.
> rich-shorting-its-stock-ly y'rs - tim
>Spambayes mailing list
More information about the Spambayes