[spambayes-bugs] [ spambayes-Feature Requests-943116 ] White list for domains/email addresses

Fri Apr 30 10:04:15 EDT 2004

Feature Requests item #943116, was opened at 2004-04-27 09:04
Message generated for change (Comment added) made by darklaser
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=943116&group_id=61702

Category: pop3proxy
Group: None
Status: Open
Priority: 5
Submitted By: DarkLaser (darklaser)
Assigned to: Nobody/Anonymous (nobody)
Summary: White list for domains/email addresses

Initial Comment:
      A nice feature would be to have a domain/email 
address white list where you could specify email 
addresses which should be marked as ham without 
regard to content.  It would also be nice to be able to 
say anything from the domain belonging to the company 
I work for should also be marked as ham regardless of 
content.

Anyway, my 2bits.

Thanks,
David

----------------------------------------------------------------------

>Comment By: DarkLaser (darklaser)
Date: 2004-04-30 07:04

Message:
Logged In: YES 
user_id=1030399

Tony, of course there is something wrong with the training, 
we have already sighted that.  That’s why in my last post I 
indicated I was going to start it over (and not use any initial 
training).

I read and understood what it said in the FAQ.  We can close 
this in another day or so after  I've gotten Kenney the info he 
wants, for him to be able to discover why SpamBayes is 
providing such poor results.  Now the reason I thought this 
feature request should remain open was because I figured 
someone else may see it and decide its a good idea, build it 
and present it to the project leader.  

Yes, I already have message rules in outlook express to put 
email from specific individuals into specific folders, but the 
problem the false positive creates is, I have to drag each 
email out of outlook express, edit and remove 'spam' from the 
subject, then put it back in outlook express.  Quite time 
consuming.  You are right under those circumstances I would 
be better off without SpamBayes.  In the process of 
requesting the white list feature (which would fix this the 
majority of my problem), I learned from Kenney that 
SpamBayes was not performing as it should (based on my 
training), therefore I should be able to improve my results 
after starting over on training.

Now on to what I’ve got.  I asked a coworker to send me an 
email.  I expected it would be a false positive and it was.  To 
provide some privacy for the company I work for, I replaced 
the domain and ip info with text enclosed by [] explaining 
what was there.  Now for the data:
--------------------------------
Original clues for: spam,quarterly review today? (44) 
   Word Probability Times in ham Times in spam 
*H* 0.01 - - 
*S* 1.0 - - 
to:name:david [mylastname] 0.0 156 0 
from:addr:[ourdomainname] 0.02 239 11 
received:[ouripaddress] 0.09 7 1 
from:addr:brian 0.09 2 0 
from:name:brian [coworkerslastname] 0.09 2 0 
brian 0.27 10 7 
would 0.65 71 258 
have 0.65 215 791 
you. 0.67 38 155 
going 0.71 19 91 
time 0.72 73 365 
received:[ourgatewayipclassa] 0.74 26 144 
received:[ourgatewayipclassa+b] 0.74 26 144 
received:[ourgatewayipclassa+b+c] 0.74 26 144 
received:[ourgatewayipclassa+b+c+d] 0.74 26 144 
received:unknown 0.74 26 144 
your 0.77 269 1745 
like 0.77 56 371 
some 0.78 29 199 
header:Message-ID:1 0.78 375 2705 
subject:  0.79 384 2909 
to:2**0 0.8 369 3003 
with 0.81 116 948 
header:Date:1 0.81 393 3238 
header:Return-Path:1 0.81 393 3292 
header:To:1 0.81 393 3306 
header:From:1 0.81 393 3321 
header:Subject:1 0.81 393 3336 
header:MIME-Version:1 0.82 344 3047 
review 0.85 3 34 
to:addr:[ourdomainname] 0.89 193 3248 
quarterly 0.91 0 2 
subject:review 0.91 0 2 
today 0.92 10 244 
to:addr:david 0.94 84 2596 
done, 0.95 0 4 
content-type:multipart/alternative 0.95 53 2211 
content-type:text/html 0.96 56 2444 
over 0.97 6 350 
subject:today 0.98 0 9 
spend 0.99 0 42 
subject:? 1.0 0 114 
--------------------------------
Kenney, let me know if you would like to see any more before 
I wipe out my training and start over.

David

----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2004-04-29 15:04

Message:
Logged In: YES 
user_id=552329

Oh, one other thing - is there any reason that you can't
just use your mail client (Outlook Express?) 's rules to
implement whitelisting yourself?  It's certainly the simple
solution and works for most people.

----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2004-04-29 15:03

Message:
Logged In: YES 
user_id=552329

As the FAQ says, we realise that some people want
whitelisting.  However, none of the developers do, and none
have any interest in putting in the considerable effort into
developing whitelisting capabilities.  As such, this simply
won't be added until someone comes along with code in hand.
 The developers aren't refusing to add it in (as long as it
is off by default) but there just isn't any incentive for us
to add it.  Anyone desperate for the functionality always
has the options to (a) write code or (b) pay for a product
that does have whitelisting (InBoxer, for example).

In any case, as Kenny said, 85% false positives is
unbelievably poor.  You'd be better off not using spambayes
at all!  The fp rate should be less than 5% - typically
around 1% or lower.  Something is clearly wrong with your
training.

I'd still rather have this closed - if I thought that people
would actually see it and not open a new request that would
be different, but that doesn't happen.  If it says open then
we have two places (here and the FAQ) where information
collects.  Whitelisting is brought up so often that a
tracker really isn't necessary, IMO.

----------------------------------------------------------------------

Comment By: DarkLaser (darklaser)
Date: 2004-04-29 12:31

Message:
Logged In: YES 
user_id=1030399

Yes I had stored up spam from the last 8 months incase I 
came accross a bayesien filter I wanted to train it with, but it 
all came from this one account.  The near 5,000 valid email 
are all my valid email for this account over the last 5 years.  
So it should have worked very well I would have thought.

Perhaps SpamBayes learns from initial tranning sets differently 
than from email it processes as it comes in.  So perhaps the 
solution is to remove the past training and just start training 
from scratch.  

Yes, I'll wait for a few more false positives, and I'll post them 
before wiping it out and starting over.

David

----------------------------------------------------------------------

Comment By: Kenny Pitt (kpitt)
Date: 2004-04-29 11:49

Message:
Logged In: YES 
user_id=859086

It's very unusual to hear from someone who is getting 
accuracy this poor.  Could you upload a copy of the spam 
clues for a false positive message (before training on it)?  
Seeing why SpamBayes thought the way it did when it first 
processed the message would help a lot.

I notice that you have far more training data than you have 
messages that have been processed by SpamBayes, so I 
assume you had a large initial training set.  Is it possible that 
your training data was not representative of the messages 
that you are currently receiving?

Although there is no proven best training strategy, in general 
SpamBayes seems to perform best if you initially train with 
only 5 or 10 of each type of message and then train it up on 
your current message stream instead of training it on lots of 
outdated messages.  You'll also find that SpamBayes is more 
responsive to training of new messages when you have fewer 
messages in the training database.  With the large number of 
messages that you have, it will take a *LOT* of training to 
overcome existing clues.

----------------------------------------------------------------------

Comment By: DarkLaser (darklaser)
Date: 2004-04-29 06:46

Message:
Logged In: YES 
user_id=1030399

Anadelonbrin, thanks for the url.  I had looked for something 
about white lists, but couldn’t find it.  

I maintain that a white list would be useful.  Perhaps not to 
some, but very much so for others.  I have received maybe 1 
or 2 spam claiming to be from someone on the domain for the 
company I work for in the last year, and never have received 
any claiming to be from any of my 4 personal domains.  
However, the current false positive ratio is horrible.  Try 85% 
of my good email is falsely being marked as spam.  Look at 
the number of emails I have trained, with that many trained, I 
should be getting near perfect results.
-------------------------------------------------
Total emails trained: Spam: 9728 Ham: 4939
SpamBayes has processed 546 messages - 4 (1%) good, 538 
(99%) spam and 4 (0%) unsure.
29 messages were manually classified as good (23 were false 
positives).
517 messages were manually classified as spam (0 were false 
negatives).
2 unsure messages were manually identified as good, and 2 as 
spam.
-------------------------------------------------
Ignoring the unsure messages, out of 27 good emails, 4 were 
actually marked as good and 23 as spam.  That is ridiculous.  
Perhaps I need to change something in my settings, but the 
majority of those good emails are from this one domain, so in 
my case a white list would make a world of difference.  If one 
or two spam a year get through because of a white list, no 
biggie, I can handle that.  That’s a lot easier than having to 
go manually remove the word 'spam,' from the subject of 85% 
of my email.  

I don’t have any experience with python (I’m a perl man 
myself), otherwise I would look at building a white list to send 
to the project manager.  Anyway, I still think this item should 
remain on the wish list.

Thanks,
David

----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2004-04-28 18:45

Message:
Logged In: YES 
user_id=552329

Please see FAQ 6.6:

<http://spambayes.org/faq.html#why-don-t-you-add-whitelisting-blacklisting-to-spambayes>

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=943116&group_id=61702