[Spambayes] how spambayes handles image-only spams
tim.one at comcast.net
Thu Sep 4 21:44:53 EDT 2003
> Some "normal" users get the vast majority of their ham as html. I
> know lots personally.
Cool! I don't know any. How's the addin working for them? Everyone who
has endured my comments in tokenizer.py knows I was obsessively determined
not to penalize HTML msgs, but since I don't know anyone who gets a ton of
HTML ham, the effectiveness of all that hasn't been tested by me.
>>> Statistically speaking, HTML mail is
>>> either from a spammer or from a clueless
>>> git, and in either case can usually be
>>> delayed without penalty or discarded outright.
>> As indicated above, I do not think this analysis is true anymore. And
>> characterizing someone as a clueless git because they don't change
>> their mail client's default message format or "love" plain text...
>> Well, let us know when you get back to the real world.
> I agree 100%. I fear we are starting to sound condescending - such
> labels and telling users "trust us, you really don't want that
> feature" doesn't help anyone.
Note that the "clueless git" comment was from Bill (CRM114's developer), not
one of the loving spambayes folks <wink>.
>>> Similarly, base-64 encodes are almost _always_ trash.
>> I agree, except for in-line images sent with email newsletters and
>> the like.
> The problem is that we simply don't know. One man's trash is
> another's treasure. We clearly need more research, but we have to
> be careful not to base too many assumptions on testing the mail of
> us geeks <wink>
spambayes decodes base64 sections-- so long as they have a text/* type --and
judges based on the decoded contents. Merely using base64 doesn't count for
or against a msg in our code. If it's ill-formed base64, we synthesize a
control: couldn't decode
token and try harder to decode it anyway. Apart from that, the classifier
has no way to know whether base64 was involved.
The principle here (which may never have been formulated clearly) is that
spambayes wants to score what the end user *sees*, not necessarily how it
was coded. We don't stray from the principle often; the chief exceptions
are the presence of HTML obfuscation tricks that have no conceivable
justification except an intent to deceive spam filters. Even then, we
*usually* just un-obfuscate and score what the end user sees anyway. The
biggest trick we're missing here is not accounting for that "foreground
color approximately equal to background color" hides text the end user
*doesn't* see -- it would be spambayesian not to produce any tokens for that
kind of invisible text.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 1036 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20030904/65444103/winmail.bin
More information about the Spambayes