[Spambayes] More on 'Spammer Attempts to Circumvent Bayesian Filter'

Sat Jul 17 04:33:44 CEST 2004

Well, I might as well ask about one more issue, while I'm thinking about
it.

I more or less understand how SpamBayes can work with text, but I don't
understand how it works with messages that are primarily html, and I
don't understand why it isn't easily fooled by text narratives in spam
emails.

In May, this group had a discussion of the topic, "Spammers Attempts to
Circumvent Baynesian Filter."  Len Hartley said, "Since February of
2004, more and more spammers are including what you might call short
stories or lists of innocent sounding words at the bottom of their
Email. Because of this method, I have recently found that more than half
of Spam is rated not higher than suspect."

Experienced users pooh-poohed his experience as unrepresentative and
said his database probably contained mistakes and that he probably
needed to re-train.

Well, I receive 1400 or 1500 email messages a day, and I'm having a
similar problem.  It's not causing half my Spam to misclassify, but it
is the cause of the vast majority of my "unsures," which primarily are
from spam messages of two types:

- messages that have a lot of logical, readable -- often well-written --
text or jokes that are much longer than the commercial portion, and

- messages that have a small amount of random or narrative text and that
are primarily html.

- A subset of this is html ads that either contain many random words or
gibberish, instead of a coherent sentence or paragraph.

I know this has been discussed, but I get the sense that the answer
usually is something like, "Train on them, and SpamBayes will get the
idea."

I don't think it's that simple.

I've trained on dozens of the second type, in particular, yet many
almost identical spams continue to show spam probabilities of 6-20
percent.

I get the sense that "legitimate-appearing" text isn't easily caught by
SpamBayes, and that primarily html messages don't have enough consistent
clues to raise their spam probability.  I've appended an example at the
end of this message.

Perhaps someone could explain to me (in layman's terms!) how SpamBayes
classifies in these two instances.  I've looked at the message
classification screen on several of these messages, but I still don't
understand.

I am concerned that training on spam messages with a small commercial
and a relatively large amount of well-written text will "mess up" the
training on other types of messages; yet, I get too many of these to
ignore.

In my own large mail stream, like Len Hartley, I get the impression that
SpamBayes simply is less effective with these newer types of
solicitations.

Thoughts?

Thanks.

Rich Barger
Kansas City

Text like this, methinks, causes SpamBayes to misclassify spam messages
as ham:

do a lot of management training each year for the Circle K Corporation,
a national chain of convenience stores Among the topics we address in
our seminars is the retention of quality employees -- a real challenge
to managers when you consider the pay scale in the service industry.
During these discussions, I ask the participants, "What has caused you
to stay long enough to become a manager?" Some time back a new manager
took the question and slowly, with her voice almost breaking, said, "It
was a $19 baseball glove."
And who let me down? Me ¨C I am the one that rationalized why they never
called me, or sent me flowers, or sent me love notes, or just plain put
in as much effort as I did. I settled and that hurt me in the end.
sonoridad1sulfhi`drico03especial,incitamiento lebrasta.