[spambayes-dev] Training options on Configuration Page

Meyer, Tony T.A.Meyer at massey.ac.nz
Mon Sep 22 23:02:42 EDT 2003


> On the Configuration page of the proxy, there is
> this paragraph which I found to be unclear: 
> Suppress caching of bulk ham: [...]

What was unclear?  IOW, what did you *think* it meant?

> I'd like to suggest that the following set of buttons
> replace the current "Cache messages" and "Suppress caching
> of bulk ham" section: 
>
> 1A Train only on Unsure, hide Ham and Spam on Review page 
> 1B Train only on Unsure, show Ham and Spam on Review page 
> 1C   1B plus default to Discard for Ham 
> 1D   1B plus default to Discard for Spam 
> 2A Train on all messages except Unsure 
> 2B Train on all messages except Unsure, hide List Ham 
> 2C Train on all messages except Unsure and List Ham, hide List Ham

One issue with this is that is means *major* changes to the way the
configuration page works.  What is currently does is present several
groups of Option objects to the user, allowing them to change one or
more of their values (and then saves that to the file).  Your suggestion
means that it would have to have an additional layer above, where some
options presented mean changes to multiple Option objects.

I think that the wording above is unclear.  You're *not* setting
SpamBayes to do any training - the training is all (at the moment, and
excluding the plug-in) done manually.  What you're changing is which
messages are displayed in the review page.  This would be more accurate:

  1A Hide Ham and Spam, show Unsure
  1B Show all.
  1C Show all, default to Discard for Ham 
  1D Show all, default to Discard for Spam 
  2A Show Ham and Spam, hide Unsure
  2B Show Ham and Spam, hide Unsure, don't cache bulk-ham
  2C Show Ham and Spam, hide Unsure, don't cache bulk-ham

You'll see that there is no difference between 2B and 2C.  You can't
train on messages that you don't display, so 2B can't be done. (Unless
you use the 'find message' query).

At the moment 1B is the default.  You can't currently stop any category
displaying, so 1A, 2A-C are not possible.  If options were added to do
so, then I think:

  In review page, show messages classified as:
    [x] Ham
    [x] Spam
    [x] Unsure

would be much clearer (and should be an advanced option).  This means
you retain control over what is displayed - you could, for example, have
'show ham and unsure, hide spam' (if the corpus was tilted towards spam,
this would be a good choice).  Presenting combinations of all the
different options ends up looking very confusing when the number of
combinations rises, and makes it hard to make a single change.

There's also quite a difference between not showing the messages on the
review page, and not caching them.  Not caching them saves disk space,
but means that you can't correct a misclassified message (you could if
you cached, but didn't display, with the 'find message' query, or with
the smtpproxy).

> 2C is the effect of the current system with both caching
> options turned on. It leads to huge spam imbalances, at least
> on the system my wife filters through.

Why then, enable the option that is off by default?  There's a reason
that that's the default <wink>.

> If I had that array of choices, I would recommend 2A
> is the default for a new database, 1A as the default
> once there were adequate trained messages, and 1C and
> 1D would be used for a short period to address any imbalance. 

As Tim would say, test it!  There hasn't been enough testing to show
what the 'ideal' training method is.  For example, I would never
recommend 2A, 2B, or 2C as training regimes - to me, it's always worth
training on an unsure.  I'm not sure I like changing options on a user's
behalf, either.  What do you do for advanced users, who know what they
are doing?  I'm sure they wouldn't like the options just changing
themselves.  *If* a training regime is ever identified as being 'the
best', then I think the best move would be to set the defaults to match
that, and if there are (as in your suggestion) extra things to do in
certain circumstances, present a warning/suggestion to the user.  (Like
the plug-in warns the user if there is an imbalance).

> Also, from the user perspective, nobody cares at all about caching. 
> To the user these are Training Options, I would recommend the section
> be renamed,

But doesn't that give the impression that training will be done,
depending on what options you set?  No training will be done unless you
manually indicate a message to train.

IMO, the solution to this would be to move the 3 'no_cache' options to
the advanced page (perhaps also clarifying the wording, if suitable
suggestions are made).  If the user doesn't understand the options, then
they shouldn't be setting them, just like the other advanced ones.

Again IMO, I don't think that the web interface will ever be simple
enough to use for a certain class of users.  Right at one end are people
who will never get that things have to be trained, and will have to use
either a pre-trained database, or a non-training system.  Beside them
are a group who I think will only be able to manage drag'n'drop style
training, or where clicking a button indicates that the currently
selected message (in the mail client) is good/bad (people who manage to
use the plug-in).  Once it's finished (and once twisted is stable) the
pop3dnd script will provide the former; the latter is very difficult
without an actual integrated plug-in.  Those that can manage a bit more
can use the web interface - and those people are probably clued enough
to understand that there is a difference between messages displayed to
review, and messages that are trained on.

=Tony Meyer



More information about the spambayes-dev mailing list