[Spambayes] how spambayes handles image-only spams
tim.one at comcast.net
Thu Sep 4 19:43:32 EDT 2003
The damage to email services done by the Sobig worm is making discussion
just about impossible; I'll try to paste together several msgs I've seen in
this thread, although I may well be missing some.
>>>> The other side to this is that *any* evidence of HTML
>>>> is a strong spam indicator in most corpora... virtually
?>>> nothing using HTML could avoid being classified as spam...
>>> This doesn't seem right to me, at least on an intuitive level.
>>> We're an Outlook 2003 shop, and we've used Windows Group Policies
>>> to force all internal users to create HTML messages instead of
>>> Microsoft RTF format.
That's why I said "most corpora" -- you're definitely an exception on this
mailing list. In fact, you're the only person to date who has owned up
<wink> to having a high number of HTML ham!
That's OK, though -- spambayes doesn't penalize HTML just for being HTML.
Many other systems do.
>>> So a great big heaping pile of my non-spam corpus would be messages
>>> that contain <P> <BR> and other "innocent" HTML tags. Shouldn't the
>>> statistical nature of SpamBayes give these tokens something near
>>> 0.5 as a score, since they appear frequently in both corpora?
Yes, *if* spambayes produced such tokens. But it doesn't -- spambayes
strips almost all evidence of HTML. Else for the (so far) vast majority of
people who get very little HTML ham, their HTML ham would have virtually no
chance of getting correctly classified as ham.
>> No, my corpora agree with Tim Peters - spammers use HTML far more
>> often than "normal" users.
> In my opinion, this is simply untrue, at least for the American
> corporate user.
Your opinion about Bill's corpora is irrelevant <wink> -- I have no doubt
that Bill characterizes the data he's seen correctly. Or that you
characterize yours correctly. You two simply have very different kinds of
data (and I suspect, but don't know, that the kinds of data Bill sees are
much closer to the kinds most spambayes users to date see).
>> Statistically speaking, HTML mail is either from a spammer or from
>> a clueless git, and in either case can usually be delayed without
>> penalty or discarded outright.
The spambayes project has never agreed with that, largely because both my
sisters delight in creating elaborate HTML email, and I really enjoy it
<wink>. I'm also active (more off than on lately, alas) trying to help
newbies with various tech issues, and-- like my sisters --they naturally
gravitate toward fancy email features. I've come to believe that we
7-bit-ASCII-slinging, plain-text, Courier-loving, "email is for the exchange
of information" types are destined to become an insignificant minority of
email users. The only thing holding back our extinction is that, so far, we
still run everything <wink>.
>> Similarly, base-64 encodes are almost _always_ trash.
In my data too, although I'd switch the emphasis to _almost_.
> At most of the U.S.-based companies I've dealt with, HTML mail is
> widely used.
What percentage of them have you dealt with? I haven't bumped into my first
> I remember a Gartner survey that said something like 90% of corporate
> desktops have either Outlook/Exchange or Notes/Domino for messaging.
> Recent versions of both of these mail clients create HTML messages by
> default rather than plain text or some proprietary rich text format.
Cool! So things are changing.
> I'd say 80% or more of business email I get - most of it from
> technical people - is HTML mail, simply because most people leave
> their Outlook or Notes mail client in its default configuration.
I guess our technical contacts don't overlap, then. The only HTML email I
ever get is from spammers, non-personal marketing collateral from companies
I do business with, corporate newsletters, relatives, and newbies. I get
about 600 emails a day, more of it from non-newbies asking tech questions
than I could possibly reply to even if that were my full-time job, and
they're never in HTML format.
Anyway, spambayes does a ton of work already to avoid penalizing HTML email
just for using HTML. If you've got reason to suspect that isn't working as
hoped for in your data, that's something we should dig into. The kinds of
HTML ham I do get are classified as Unsure more often than plain-text
read-alikes, but not really because they use HTML -- it's almost always
because they embed URLs pointing to .jpg or .gif (etc) files out on the net,
and those file extensions are tokenized, and turn out to be spam clues in my
data. It takes "more hammy content than usual" to overcome that.
More information about the Spambayes