[Spambayes] Training on unusual ham - revisited

Sat Feb 11 05:22:44 CET 2006

Seth,

Having recently been a victim of DNSBListing I strongly advice against 
doing this as many of the lists are bogus and even the good ones make 
too many mistakes or are too broad so netblocks get denied access to 
e-mail when it is one rogue hijacked domain that is the source of the 
problem.

What happened is that someone used a stolen credit card to activate a 
domain on the same hosting service I am on. The entire hosting service 
was knocked off the net for 24 hours, and I had e-mail rejected over a 
week later because of a lack of attention to what was happening on the 
part of the receiving domain.

A much better approach would be to look for forged headers in the spam. 
Almost 80+% of spam has a forged HELO line or two. If they were stopped 
at any router with a little checking script we'd all be better off and 
not suffer from being denied access because of some third party f%^&ups 
that we have no control of.

Thanks,

Allen Schaaf
Information Security Analyst
Training & Instructional Designer
Sr. Writer & Documentation Developer
Certified Network Security Analyst and
Intrusion Forensics Investigator - CEH, CHFI
Certified EC-Council Instructor

Security is lot like democracy - everyone's for it but
few understand that you have to work at it constantly.

Seth Goodman wrote:
> On Thursday, February 09, 2006 7:32 AM -0600, Bob Coe wrote:
> 
>> The difficulty is that there's no way to prune the database, either to
>> adjust the imbalance or to simply decrease the database's size. You
>> have to start again from scratch. The Spambayes establishment doesn't
>> consider this to be much of an issue, since (as Seth points out)
>> Spambayes does a good job of starting from scratch and building an
>> acceptable scoring system after seeing surprisingly little data.
>>
>> This is all fine if you can limit your spam flow to a trickle during
>> this startup period. But if you can't, things can be very unpleasant
>> for a while. As part of an upgrade of my home system, I recently had
>> occasion to install Spambayes from scratch on two accounts (mine and
>> my wife's) that receive a LOT of spam. (My home domain name is a
>> catchy one that attracts spammers and forgers like flowers attract
>> bees.)
> 
> Your cup runneth over.  I feel your pain.
> 
> <OT plea>
> This is an off-topic plea to the few people who have some control over
> the MTA software running at their site.  Please consider using DNSBL's,
> both from responsible third parties and built from local heuristics, in
> addition to existing authentication mechanisms, to cut down on the
> volume of spam that must be processed post-acceptance.  Accepting spam
> for deliver wastes your bandwidth and consumes CPU cycles.  Rejecting
> it, preferably during or right after the envelope phase of SMTP, can
> really reduce your load.  An ounce of prevention ...
> </OT plea>
> 
> 
>> So while Spambayes was in its learning curve, hundreds of spam
>> messages were pouring in and getting sent to our "possible spam"
>> folders. And because all I had to train on was ham, anything that
>> didn't go there went to our inboxes. For two or three days, until
>> Spambayes got its mind right, I had to dig through this chaff and
>> send it to the spam folders manually - not a fun task.
> 
> I'm not a Spambayes developer, so I am speaking only for myself here.  I
> think the problem is more that Spambayes doesn't do anything to
> encourage sensible training schemes.  And for a very good reason: the
> jury is still out on what is the "best" training scheme, or even one
> that is universally acceptable.  It wouldn't be responsible for the
> developers to force one scheme or another on the users, since there is
> no proof that any one particular scheme would work for the majority of
> users.
> 
> That being said, there are some things you can do manually to avoid some
> of this pain.  The first thing to note is that train on all unsures,
> forever, is usually not the best approach.  Aside from building huge
> databases, it tends to produce a trained ham/spam ratio very far from
> unity, as you describe very well below.
> 
> Here's one approach that works for me and builds a smaller database with
> a relatively equal number of ham and spam.
> 
> 
> 1) Initial training.
> 
> a) If you are just starting out from scratch, manually sort the spam in
> your inbox into a separate spam folder.  Make sure you have roughly
> equal numbers of ham and spam for training, even if this means training
> on a relatively small number of messages.  It is not necessary to use a
> large number of messages.  Anywhere from around ten to a few hundred of
> each type is sufficient.  In addition to messages in the Inbox, most
> people have a large amount of saved ham.  Resist the temptation to train
> on a large folder of saved ham without the same number of spam
> available.
> 
> b) When you have around 25 spam in your Spam folder, make sure you have
> the same number of your most recent ham in the Inbox (or ham training
> folder).  If you have more ham messages spam, temporarily move some of
> the ham to a new folder, then move it back when you're done.  If you
> have fewer ham than spam, temporarily move some saved ham into the Inbox
> (or ham training folder).
> 
> c) I don't recall what the default thresholds are, but I personally use
> 0.80 for spam and 0.05 for ham.
> 
> d) Train on the two folders with equal numbers of ham and spam.
> 
> e) In the Spambayes Manger under the Training tab, uncheck the
> incremental training checkboxes so that moving a message does not
> automatically train on it.
> 
> f) Move all messages in the Unsure folder into either the Inbox or Spam
> folders, as appropriate.  These messages are already trained, so there
> is no purpose keeping them in the Unsure folder.
> 
> 
> 2)Until Spambayes is working well (i.e. 5% Unsures and 0.1% false
> positives), try this procedure instead of simply training on all
> messages in the Unsure folder.  When new messages appear in your Unsure
> folder:
> 
> a) Make sure the spam score is displayed in the Inbox, Unsure and Spam
> folders.
> 
> b) Hit the Spambayes button on the Outlook toolbar, select "Filter
> messages" and the "Filter Now" dialog box opens.  Make sure the "Filter
> the following messages" field contains your Inbox (or other ham
> folders), Unsure and Spam folders.  Under "Filter action" select
> "Perform all filter actions".  Under "Restrict the filter to", uncheck
> both boxes.  Hit the "Close" button.
> 
> c) Select the lowest scoring spam message, if any, in the Unsure folder
> and hit the "Delete as Spam" button.  Select the highest scoring ham
> message, if any, in the Unsure folder and hit the "Recover from Spam"
> button.  This trains on the messages as well as moving them.  For this
> to have any effect, these should be messages that are not already
> trained.  Don't worry, if you select a message that is already trained,
> Spambayes will move it but it won't train on it again.  If you
> accidentally train a message in the wrong category, it is very important
> that you select it and train it in the correct category.
> 
> d) Hit the Spambayes button on the Outlook toolbar, select "Filter
> messages" and the "Filter Now" dialog box opens.  Hit the "Start
> Filtering" button.  When it is finished filtering, hit the "Close"
> button.  Some messages may disappear from the Unsure folder and others
> may move into it.
> 
> e) Glance at your Inbox and Spam folders for false positives (ham in the
> spam folder) and false negatives (spam in the ham folder).  Train on any
> of these using the "Delete as Spam" and "Recover from Spam" buttons.
> This should not happen very often.
> 
> f) Go back to step c and repeat this process on each message in the
> Unsure folder that you haven't already trained on.  Occasionally, a
> message will still score as unsure even after you've trained on it, or
> subsequent training may cause a message that previously classified
> correctly to now classify as Unsure.  Don't worry, it will eventually
> classify correctly when you train on other similar messages.
> 
> g) Once in a while, Hit the Spambayes button on the Outlook toolbar,
> select "Spambayes Manager".  In the "General" tab, there is a field that
> tells you how many ham and spam are trained.  If the numbers are
> somewhat unequal (more than 2:1 in either direction), train on some
> messages of the type that has too few in the training set.  The easiest
> way to do this is to move the message into the Unsure folder, then hit
> either the "Delete as Spam" and "Recover from Spam" buttons.  When
> picking additional messages to train on, try to use the lowest scoring
> spam or the highest scoring ham, i.e. the messages that came closest to
> being Unsures.
> 
> h) When Spambayes is working well, go on to step 3.
> 
> 
> 3) You know that Spambayes is working well when Unsures are not more
> than 5-10% of your incoming mail flow and you rarely (0.1%) have a ham
> classify as Spam.  Since spam is much more likely to classify as unsure
> than ham, the percentage of your messages that are unsures depends on
> your incoming ham/spam ratio.  Once your are at this level of
> performance, it is probably reasonable to train on any ham that ends up
> in the Unsure folder, but generally not train on Spam in the Unsure
> folder.  That is, only train on Unsure spam that you think the filter
> _really_ should have caught.
> 
> For example, a lot of spam has "word salad" added as hidden text to
> confuse Bayesian filters like Spambayes.  These are either random words
> from a dictionary or passages from news articles or books.  The net
> result of including a lot of random words in a spam is to have it score
> somewhere around 50%.  Spambayes already ignores any words that score
> between 0.4 and 0.6, so a message's score is only the result of words
> that are considered ham or spam words.  It's debatable if you want to
> train on enough messages to have Spambayes correctly ignore "word salad"
> words.  You can train on word salad spam, but if you do, the databases
> will get larger and you will start to see more ham wind up in the Unsure
> folder, thus requiring further training to correct it.
> 
> In short, once you get acceptable performance, there is nothing wrong
> with just deleting spam that ends up in the Unsure folder.  No matter
> how much training you do, there will always be some spam that classifies
> as Unsure.  You only need to do further training if overall performance
> declines.  Playing too much with the thresholds is dangerous and risks
> getting false positives, which is the worst possible outcome for a spam
> filter.
> 
> 
>> Another point (I've made it before, but I guess it bears repeating) is
>> that the database imbalance is absolutely inherent in the current
>> implementation of the Spambayes algorithm, at least in the Outlook
>> plugin. Because users set the cutoffs to avoid false positives (you
>> have to if the program is going to be useful), virtually all of
>> Spambayes's mistakes are false negatives. Since mistakes are all you
>> train on after the initial startup, virtually all new entries into the
>> database are spam.
> 
> That is inherent, as you say.  See step 2g, above, to correct this
> problem, and step 3, above, to avoid it.
> 
> This property is actually one of the more desirable features of
> Spambayes.  Once trained, it is has an extremely low false positive
> rate.  That is, it is rare that a ham is classified as spam.  This is
> the result of the two thresholds being a different distance from 0.5 and
> a number of heuristics that proved to help.  The natural result of this
> is that virtually all the unsures are spam.  That is an advantage for
> the user, as you don't have to be as vigilant about looking at the
> Unsures, and even less so with the Spam folder, since they rarely
> contain ham.
> 
> 
>> The better job Spambayes does, the worse the imbalance becomes.
>> Note that the ham/spam ratio of incoming messages affects only the
>> speed with which this effect takes hold, not the eventual outcome.
>> If you use Spambayes correctly, and use it long enough, your
>> database *will* achieve a highly distorted ham/spam balance.
> 
> That's only if you define training on every unsure as using Spambayes
> correctly.  I disagree on that particular point, though the operating
> instructions don't say this.  Once Spambayes is operating well, you
> should probably not train on all the spam in the Unsure folder.
> 
> 
>> If that degrades performance, and many believe that it does, then
>> it's a problem that has yet to be solved.
> 
> I agree with this.  I think it would be a good idea, for example, if the
> initial training tab of Spambayes had an experimental option for
> training on an equal number of ham and spam, using the smaller of the
> number of messages in the indicated ham and spam folders.  I also think
> it would be a good idea if Spambayes had an experimental option to pop
> up a warning if the numbers of trained ham and spam were different by
> more than a user-defined ratio (which could be 1.5:1 as default) and
> suggest what the user should do to correct it.  Finally, unless
> Spambayes implements some form of pruning old messages from the
> database, there should be something in the instructions telling users
> not to keep training on all Unsures, once satisfactory performance is
> achieved.  This will cause the database to grow without bound and
> probably without improvement to performance.  A step in this direction
> might be to have the "train on move to folder" option off by default,
> with an warning text box appearing if you turn it on.  The warning would
> explain what will happen if you keep training on unsure spam
> indefinitely, and how to avoid that.
> 
> --
> Seth Goodman
> 
> _______________________________________________
> SpamBayes at python.org
> http://mail.python.org/mailman/listinfo/spambayes
> Check the FAQ before asking: http://spambayes.sf.net/faq.html
> 
>