[Spambayes] how spambayes handles image-only spams
tim.one at comcast.net
Tue Sep 9 23:26:44 EDT 2003
>> spambayes was developed against many peoples' test corpora,
>> although, as I said before, I don't think any of them had a
>> significant quantity of HTML ham (and I don't think Bill's did
> Not true - my mungo-testset of 30K or so items had a significant
> amount of HTML - this is why I jumped up and down on the HTML stuff.
Oh, now you want to be treated like a human too <wink>? OK, you're right.
> Tim's original python-list archive had almost no HTML ham.
That's true. The only HTML ham it had was mailing-list administrivia
requests ("subscribe", "unsubscribe") sent to a wrong address, and a few
newbie questions. Drops in the ocean.
>> In spambayes I'd be more inclined to write special code to identify
>> the img-src-http dance, and synthesize a token for that. It's only
>> one token, though, and all tokens carry the same weight here -- it
>> may still not be enough to give "a typical" short message of this
>> ilk a strong enough score to nail it. The only way to know is to
>> try it.
> Note also that synthesising a bunch of (highly correlated) clues to
> attempt to fix the problem of the single URL spams often leads to
> unexpected (bad) consequences with scoring other email. In general,
> highly correlated clues are bad.
Well, correlation actually appears to help us more often than it hurts us,
and stripping HTML in the tokenizer was a hack to blind the classifier to
the strongest source of harmful correlation I know of. We've added a few
pieces of HTML evidence back since then, and it's helped. I don't know
whether an img-source-http token would help or hurt (hence "the only way to
know is to try it"), since it's very likely correlated with existing
synthesized url:jpg and url:gif tokens. I get the latter in ham too, but
mostly in (wanted) HTML marketing collateral of various kinds with *tons* of
hammy clues -- they score so hammy now that one new contradictory token
won't hurt them. But that's just me.
> This is why it's necessary to test additions to make sure that in fixing
> one problem you're not adding 4 others.
> Someone should go through the list archives and work out the ratio of
> attempted tokeniser tricks that made things worse to ones that actually
> improved the situation. I'd guess that it's something like 4 rejected for
> every one that went in...
Whatever you came up with would be an underestimate! In the very early
days, I tried new stuff 7 days a week whenever I was awake. Performance was
much worse then and it was much easier to find strong improvements. The
only evidence of those experiments anyone saw was about the ones that
worked, since those are the ones that got checked in. Words about *some*
others survived in the comments and in scattered email. I think I threw
away about 9 changes then for each that got checked in. Goodness, I *still*
personally insulted that folding case turned out to work as well as
preserving it <0.5 wink>.
More information about the Spambayes