[Spambayes] how spambayes handles image-only spams
rmalayter at bai.org
Thu Sep 4 20:29:05 EDT 2003
From: Tim Peters [mailto:tim.one at comcast.net]
>> At most of the U.S.-based companies I've dealt with, HTML mail is
>> widely used.
> What percentage of them have you dealt with? I haven't bumped into my
> such yet.
I've dealt with a very small percentage, of course. But enough for my
opinion to be statistically valid, I think. Our company exchanges mail
with the largest US banks and technology vendors for the financial
services industry (HP, IBM, NCR, etc.). A very large percentage of their
mail seems to originate with Outlook/Exchange, and the rest seems to
come from Notes/Domino shops. The only "pure SMTP" mail I get seems to
come from academic institutions or staff at our ISP.
>> I'd say 80% or more of business email I get - most of it from
>> technical people - is HTML mail, simply because most people leave
>> their Outlook or Notes mail client in its default configuration.
> I guess our technical contacts don't overlap,
> then. The only HTML email I ever get is from
The vast majority of "technical" people I correspond with are
implementation consultants and programmers. These people don't often
subscribe to mailing lists, don't think much about spam or system
administration, nor do they seem to care much about security. They just
want to be able to use italics and bold text in their email.
> Anyway, spambayes does a ton of work already to
> avoid penalizing HTML email just for using HTML.
> If you've got reason to suspect that isn't working
> as hoped for in your data, that's something we should
> dig into.
No, I'm seeing a 96% capture rate, with no false positives, which isn't
too shabby. But I've heard of better. I was just guessing if all HTML
tags were tokenized, I would see better performance against image-only
spams (which seem to get though my filter more than anything else).
But as you mentioned, the "everything is a token" approach hasn't worked
all that well with SpamBayes and HTML. Perhaps it would fare much better
with a "sliding window" tokenizer such as that used by CRM-114. I'm
betting CRM-114 would hammer "img src http" and other image-related tag
sequences with my corpus, while leaving <P>, <BR> and the like alone.
Has anyone looked at implementing a sparse windowing system similar to
CRM-114's in SpamBayes? Intuitively, it seems like this would do much
better with HTML tags as well as mail header information, at the expense
of CPU and DB storage.
More information about the Spambayes