[spambayes-bugs] [ spambayes-Feature Requests-943116 ] White list
for domains/email addresses
SourceForge.net
noreply at sourceforge.net
Fri Apr 30 10:04:15 EDT 2004
Feature Requests item #943116, was opened at 2004-04-27 09:04
Message generated for change (Comment added) made by darklaser
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=943116&group_id=61702
Category: pop3proxy
Group: None
Status: Open
Priority: 5
Submitted By: DarkLaser (darklaser)
Assigned to: Nobody/Anonymous (nobody)
Summary: White list for domains/email addresses
Initial Comment:
A nice feature would be to have a domain/email
address white list where you could specify email
addresses which should be marked as ham without
regard to content. It would also be nice to be able to
say anything from the domain belonging to the company
I work for should also be marked as ham regardless of
content.
Anyway, my 2bits.
Thanks,
David
----------------------------------------------------------------------
>Comment By: DarkLaser (darklaser)
Date: 2004-04-30 07:04
Message:
Logged In: YES
user_id=1030399
Tony, of course there is something wrong with the training,
we have already sighted that. Thats why in my last post I
indicated I was going to start it over (and not use any initial
training).
I read and understood what it said in the FAQ. We can close
this in another day or so after I've gotten Kenney the info he
wants, for him to be able to discover why SpamBayes is
providing such poor results. Now the reason I thought this
feature request should remain open was because I figured
someone else may see it and decide its a good idea, build it
and present it to the project leader.
Yes, I already have message rules in outlook express to put
email from specific individuals into specific folders, but the
problem the false positive creates is, I have to drag each
email out of outlook express, edit and remove 'spam' from the
subject, then put it back in outlook express. Quite time
consuming. You are right under those circumstances I would
be better off without SpamBayes. In the process of
requesting the white list feature (which would fix this the
majority of my problem), I learned from Kenney that
SpamBayes was not performing as it should (based on my
training), therefore I should be able to improve my results
after starting over on training.
Now on to what Ive got. I asked a coworker to send me an
email. I expected it would be a false positive and it was. To
provide some privacy for the company I work for, I replaced
the domain and ip info with text enclosed by [] explaining
what was there. Now for the data:
--------------------------------
Original clues for: spam,quarterly review today? (44)
Word Probability Times in ham Times in spam
*H* 0.01 - -
*S* 1.0 - -
to:name:david [mylastname] 0.0 156 0
from:addr:[ourdomainname] 0.02 239 11
received:[ouripaddress] 0.09 7 1
from:addr:brian 0.09 2 0
from:name:brian [coworkerslastname] 0.09 2 0
brian 0.27 10 7
would 0.65 71 258
have 0.65 215 791
you. 0.67 38 155
going 0.71 19 91
time 0.72 73 365
received:[ourgatewayipclassa] 0.74 26 144
received:[ourgatewayipclassa+b] 0.74 26 144
received:[ourgatewayipclassa+b+c] 0.74 26 144
received:[ourgatewayipclassa+b+c+d] 0.74 26 144
received:unknown 0.74 26 144
your 0.77 269 1745
like 0.77 56 371
some 0.78 29 199
header:Message-ID:1 0.78 375 2705
subject: 0.79 384 2909
to:2**0 0.8 369 3003
with 0.81 116 948
header:Date:1 0.81 393 3238
header:Return-Path:1 0.81 393 3292
header:To:1 0.81 393 3306
header:From:1 0.81 393 3321
header:Subject:1 0.81 393 3336
header:MIME-Version:1 0.82 344 3047
review 0.85 3 34
to:addr:[ourdomainname] 0.89 193 3248
quarterly 0.91 0 2
subject:review 0.91 0 2
today 0.92 10 244
to:addr:david 0.94 84 2596
done, 0.95 0 4
content-type:multipart/alternative 0.95 53 2211
content-type:text/html 0.96 56 2444
over 0.97 6 350
subject:today 0.98 0 9
spend 0.99 0 42
subject:? 1.0 0 114
--------------------------------
Kenney, let me know if you would like to see any more before
I wipe out my training and start over.
David
----------------------------------------------------------------------
Comment By: Tony Meyer (anadelonbrin)
Date: 2004-04-29 15:04
Message:
Logged In: YES
user_id=552329
Oh, one other thing - is there any reason that you can't
just use your mail client (Outlook Express?) 's rules to
implement whitelisting yourself? It's certainly the simple
solution and works for most people.
----------------------------------------------------------------------
Comment By: Tony Meyer (anadelonbrin)
Date: 2004-04-29 15:03
Message:
Logged In: YES
user_id=552329
As the FAQ says, we realise that some people want
whitelisting. However, none of the developers do, and none
have any interest in putting in the considerable effort into
developing whitelisting capabilities. As such, this simply
won't be added until someone comes along with code in hand.
The developers aren't refusing to add it in (as long as it
is off by default) but there just isn't any incentive for us
to add it. Anyone desperate for the functionality always
has the options to (a) write code or (b) pay for a product
that does have whitelisting (InBoxer, for example).
In any case, as Kenny said, 85% false positives is
unbelievably poor. You'd be better off not using spambayes
at all! The fp rate should be less than 5% - typically
around 1% or lower. Something is clearly wrong with your
training.
I'd still rather have this closed - if I thought that people
would actually see it and not open a new request that would
be different, but that doesn't happen. If it says open then
we have two places (here and the FAQ) where information
collects. Whitelisting is brought up so often that a
tracker really isn't necessary, IMO.
----------------------------------------------------------------------
Comment By: DarkLaser (darklaser)
Date: 2004-04-29 12:31
Message:
Logged In: YES
user_id=1030399
Yes I had stored up spam from the last 8 months incase I
came accross a bayesien filter I wanted to train it with, but it
all came from this one account. The near 5,000 valid email
are all my valid email for this account over the last 5 years.
So it should have worked very well I would have thought.
Perhaps SpamBayes learns from initial tranning sets differently
than from email it processes as it comes in. So perhaps the
solution is to remove the past training and just start training
from scratch.
Yes, I'll wait for a few more false positives, and I'll post them
before wiping it out and starting over.
David
----------------------------------------------------------------------
Comment By: Kenny Pitt (kpitt)
Date: 2004-04-29 11:49
Message:
Logged In: YES
user_id=859086
It's very unusual to hear from someone who is getting
accuracy this poor. Could you upload a copy of the spam
clues for a false positive message (before training on it)?
Seeing why SpamBayes thought the way it did when it first
processed the message would help a lot.
I notice that you have far more training data than you have
messages that have been processed by SpamBayes, so I
assume you had a large initial training set. Is it possible that
your training data was not representative of the messages
that you are currently receiving?
Although there is no proven best training strategy, in general
SpamBayes seems to perform best if you initially train with
only 5 or 10 of each type of message and then train it up on
your current message stream instead of training it on lots of
outdated messages. You'll also find that SpamBayes is more
responsive to training of new messages when you have fewer
messages in the training database. With the large number of
messages that you have, it will take a *LOT* of training to
overcome existing clues.
----------------------------------------------------------------------
Comment By: DarkLaser (darklaser)
Date: 2004-04-29 06:46
Message:
Logged In: YES
user_id=1030399
Anadelonbrin, thanks for the url. I had looked for something
about white lists, but couldnt find it.
I maintain that a white list would be useful. Perhaps not to
some, but very much so for others. I have received maybe 1
or 2 spam claiming to be from someone on the domain for the
company I work for in the last year, and never have received
any claiming to be from any of my 4 personal domains.
However, the current false positive ratio is horrible. Try 85%
of my good email is falsely being marked as spam. Look at
the number of emails I have trained, with that many trained, I
should be getting near perfect results.
-------------------------------------------------
Total emails trained: Spam: 9728 Ham: 4939
SpamBayes has processed 546 messages - 4 (1%) good, 538
(99%) spam and 4 (0%) unsure.
29 messages were manually classified as good (23 were false
positives).
517 messages were manually classified as spam (0 were false
negatives).
2 unsure messages were manually identified as good, and 2 as
spam.
-------------------------------------------------
Ignoring the unsure messages, out of 27 good emails, 4 were
actually marked as good and 23 as spam. That is ridiculous.
Perhaps I need to change something in my settings, but the
majority of those good emails are from this one domain, so in
my case a white list would make a world of difference. If one
or two spam a year get through because of a white list, no
biggie, I can handle that. Thats a lot easier than having to
go manually remove the word 'spam,' from the subject of 85%
of my email.
I dont have any experience with python (Im a perl man
myself), otherwise I would look at building a white list to send
to the project manager. Anyway, I still think this item should
remain on the wish list.
Thanks,
David
----------------------------------------------------------------------
Comment By: Tony Meyer (anadelonbrin)
Date: 2004-04-28 18:45
Message:
Logged In: YES
user_id=552329
Please see FAQ 6.6:
<http://spambayes.org/faq.html#why-don-t-you-add-whitelisting-blacklisting-to-spambayes>
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=943116&group_id=61702
More information about the Spambayes-bugs
mailing list