[Spambayes] Corpus module (was: Upgrade problem)

Fri Nov 8 04:48:54 2002

> Laughing and pointing should be directed towards me rather than Tim.

None of that, but some thoughts <wink>.

I think that the classes I posted a while ago suffer from the exact reverse
problem as your idea.  My idea was to make a "message store" that is largely
independent of training.  I believe the problem with your design is that it
deals with the training at the expense of the message store.

Obviously, but worth mentioning, is that there are competing interests here.
My focus is towards clients, and specifically the outlook one (if there were
more clients I would be happy to think of them too <wink>).  Alot of the
focus of this group is towards admins rather than individuals (which is just
fine!)  But it seems the current thinking is of a corpus as being a fairly
static, well-controlled set of messages used almost purely for training
purposes.

For client programs, this may not be practical.  The corpus is a more
dynamic set of messages - and worse, actually *is* the user's set of
messages rather than a collection of message copies.

For example, "moving" a message in a corpus may actually mean moving the
message in the user's real inbox.  This may or may not be what is intended -
a corpus "move" operation is more about changing a message's classification
than it is about physically moving pieces of mail around.

> A Corpus wouldn't know how to create Message objects, nor would a Message
> object know how to create itself - classes *derived from* them would know
> how to do that.  For instance (totally untested code, probably full of
> typos) -
>
> class Message:

Jeremy and I both posted real code, so starting with something that takes
that into consideration would be good.

> I may be putting too much
> into the base class by demanding that the text of the message be given to
> the constructor - that precludes making FileMessage lazy, and
> only read the
> file when it needs to.]

It also defeats the abstract nature of the class.

> 'Corpus' works the same way; again, the details may be naive, but this is
> the general idea:

I'm hoping I don't sound grumpy, but again, the few systems that already
exist for this engine are the best ones to use to discover the naivety early
<wink>

> You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an
> IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on.

I can't quite imagine that at the moment, as per my comments at the top.

Off the top of my head, I believe we need:
* An abstract "message id"
* A message classification database, as discussed before - basically just a
dictionary, keyed by ID, holding either "spam" or "ham".
* A "corpus" becomes just an enumerator of message IDs for bulk/batch
training.  It has no move etc operations.
* A "message store" is capable of returning a message object given its ID.
* The training API simply takes message objects and updates the probability
and message databases.

At that level, we really don't need much else - no folders or any other
grouping of messages.  I'm really not too sure there is much value in adding
higher-level concepts such as folders or message store "move" operations -
certainly not at the outset, where there are too many competing
requirements.

> Yes - this could work using observer objects registered with Corpus
> objects:

This could work, but may be too simple to be necessary.  If the process of
re-training a message in the Outlook GUI becomes:

def RetrainMessageAsSpam():
	# Outlook specific code to get an ID.
	message = message_store.GetMessage(id)
	if not classifier.IsSpam(message):
		classifier.train(message, is_spam=True)

And not a whole lot else, it doesn't seem worth it.  Unfortunately, the
decision to perform the retrain is the complex, but client specific part.
Is this a newly delivered message?  Did the user manually move the message
somewhere?  Did the user click one of our buttons?  Is the user deleting old
ham that we want to train on before it dies forever?

Outlook does this via examining what Outlook event we are seeing, and
looking at meta-data we possibly previously attached to the message.  I'm
not sure this can be encapsulated well at the moment without adding all our
meta-data etc baggage to the base classes.

> Most of the *new* code that's needed is defining the abstract concepts and
> their interfaces, rather than writing code that actually *does* anything -
> it's building a framework.

*cough* ummm...  This is doomed to failure.  Code *must* do something to be
taken seriously.  At the very least, I would expect to see the existing test
driver framework running against these "abstract concepts" <wink>

> Once the framework is there, most of the code needed to implement the
> functionality should already be in the project - code to hook
> into Outlook,
> to train on a message, to parse mbox files, and so on.  It just needs
> hooking into the framework.

See above <wink>.

Mark.