[Spambayes] More on Training Disparity Issues
Richard B Barger ABC APR
Rich at RBarger.com
Mon Jul 19 05:55:21 CEST 2004
Thanks, Tony. That's excellent, very helpful information; it would be
particularly valuable at the beginning of training or on smaller volumes of
But I rarely use Review now. Let me explain:
- (Reminder: I use Netscape 4.79, POP3 Proxy Version 1.0rc2, and the Web-based
interface. And I receive a huge volume of email.)
- When I started using SpamBayes, I was having much trouble with ZoneAlarm,
which I have previously documented on the SpamBayes discussion list. I was
unable to load the "Review messages" page -- every time I tried, it would lock
up my computer.
- So, I trained manually, using the "Train on a message, mbox file or dbx file"
section of the Web Interface. That worked fine; however, because I didn't
understand how to use the mbox training, I copied and pasted and trained the
messages one by one.
- By the time I had the ZoneAlarm problem solved, I already had manually trained
hundreds of messages.
- I occasionally load the Review messages page -- it works fine with this
version of ZoneAlarm -- but it already shows mostly spam.
In today's listing of untrained messages (for me, Sunday -- here -- isn't a
"typical" day; most days, I get more hams and unsures), the Review Messages page
has 34 messages classified as unsure, 15 hams, and nearly 1200 classified as
spam. During the day, I manually classified several messages as they came in.
And yes, I understand how to click the column header and, for instance, not
train the classifier on the already-correctly-identified spam messages.
- For me -- and I understand that my circumstances are unusual -- there are two
disincentives to using the review page:
1 - Message Subject lines and senders are not always sufficient for me to
tell whether a ham is properly classified, or, more often, how an unsure message
should be classified.
I have to click on some of the messages and look at the View Message screen. No
problem, of course, except that it's extra steps -- remember the volume of mail
that I'm dealing with -- that I don't have to take if, instead, I'm looking at
the hard copy of the actual message in my inbox and can see everything at a
Thus, once I'm in the situation I'm in now, having already done a lot of
training, it would seem that using the Review page is a slower process than
either 1) cutting and pasting a message into the "Train on a message ..." window
or 2) using the mbox method (now that I sort of understand it).
2 - Here is an item maybe you or someone could fix:
Remember, my lists of untrained messages are very long. If I am on the Review
screen and click on the message and look at it in the View Message screen, your
program doesn't register a placeholder. So, rather than returning to that
message in the list, so that I can keep scanning subject lines, when I return
from the View Message screen to the Review page, I am sent back to the top of
Each time I look at an individual message and return to the Review screen, I
must scroll down the list again and again, searching for where I was when I went
to the View Message screen. Again, that slows the process down. With a large
volume of mail, it slows it down a bunch.
For future versions, it might be worth considering a change that has the
classifier hold its place when the user departs from the Review screen to look
at the View Message screen.
Thus, for those two reasons, I haven't used the Review pages much at all.
Tony, I certainly can see how using the Review screen would be useful for
initial training, and that it probably works extremely well with smaller volumes
of mail, but, because of my odd circumstances, it is more cumbersome than
Obviously, if I have to re-train from scratch sometime, I certainly will use the
Review page method, and I appreciate your suggesting it.
Three quick questions, pls:
1 - Adam Walker has given me some very helpful guidance on mbox (Thank you,
Adam!), but I wanted to make sure I understood him correctly:
Let's assume that, for example, all the messages in my Unsure folder are spam.
Using the "Train on a message, mbox file or dbx file" section of the Web
Interface page, can I simply browse to and upload the Unsure folder and click
the "Train as Spam" button, and it will be as effective as if a) I had trained
on those messages from the Review messages page, and as good as if b) I had
displayed the page source for those messages and cut-and-pasted them in their
entirety into the "Train on a message ..." box?
2 - There's no way to sort POP3 messages by ascending/descending order of spam
score using the Web Interface, is there? I believe I've noticed some
conversation about this capability in Outlook, but I can't figure out any way to
do it with the Web Interface.
3 - I'm not sure which folders are which. I may have missed this in the help
files and FAQs, but it would be helpful to have a list of file names and
functions and the typical folders in which they reside -- a sort of "typical"
SpamBayes hierarchy. It might be necessary to do different versions for the
different flavors of SpamBayes, but you'll know where and what and how.
Tony, I love the product, and I certainly appreciate all the work you guys have
put into it. This list is exceptionally helpful; I see I have some more answers
from you in a separate message. I'll turn to that now.
Tony Meyer wrote:
> > Of course, training on raw message source involves yet more
> > steps for each message I want to train
> In what way are you doing your training at the moment? If you're using the
> web interface's review pages, then sb_server will automatically use the
> correct source for the messages (it saves a copy of the message as it arrives
> - this is what fills up the *-cache directories). In general, using the
> review pages is the best & simplest way to train.
> =Tony Meyer
More information about the Spambayes