Laughing and pointing should be directed towards me rather than Tim.
None of that, but some thoughts <wink>. I think that the classes I posted a while ago suffer from the exact reverse problem as your idea. My idea was to make a "message store" that is largely independent of training. I believe the problem with your design is that it deals with the training at the expense of the message store. Obviously, but worth mentioning, is that there are competing interests here. My focus is towards clients, and specifically the outlook one (if there were more clients I would be happy to think of them too <wink>). Alot of the focus of this group is towards admins rather than individuals (which is just fine!) But it seems the current thinking is of a corpus as being a fairly static, well-controlled set of messages used almost purely for training purposes. For client programs, this may not be practical. The corpus is a more dynamic set of messages - and worse, actually *is* the user's set of messages rather than a collection of message copies. For example, "moving" a message in a corpus may actually mean moving the message in the user's real inbox. This may or may not be what is intended - a corpus "move" operation is more about changing a message's classification than it is about physically moving pieces of mail around.
A Corpus wouldn't know how to create Message objects, nor would a Message object know how to create itself - classes *derived from* them would know how to do that. For instance (totally untested code, probably full of typos) -
class Message:
Jeremy and I both posted real code, so starting with something that takes that into consideration would be good.
I may be putting too much into the base class by demanding that the text of the message be given to the constructor - that precludes making FileMessage lazy, and only read the file when it needs to.]
It also defeats the abstract nature of the class.
'Corpus' works the same way; again, the details may be naive, but this is the general idea:
I'm hoping I don't sound grumpy, but again, the few systems that already exist for this engine are the best ones to use to discover the naivety early <wink>
You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on.
I can't quite imagine that at the moment, as per my comments at the top. Off the top of my head, I believe we need: * An abstract "message id" * A message classification database, as discussed before - basically just a dictionary, keyed by ID, holding either "spam" or "ham". * A "corpus" becomes just an enumerator of message IDs for bulk/batch training. It has no move etc operations. * A "message store" is capable of returning a message object given its ID. * The training API simply takes message objects and updates the probability and message databases. At that level, we really don't need much else - no folders or any other grouping of messages. I'm really not too sure there is much value in adding higher-level concepts such as folders or message store "move" operations - certainly not at the outset, where there are too many competing requirements.
Yes - this could work using observer objects registered with Corpus objects:
This could work, but may be too simple to be necessary. If the process of re-training a message in the Outlook GUI becomes: def RetrainMessageAsSpam(): # Outlook specific code to get an ID. message = message_store.GetMessage(id) if not classifier.IsSpam(message): classifier.train(message, is_spam=True) And not a whole lot else, it doesn't seem worth it. Unfortunately, the decision to perform the retrain is the complex, but client specific part. Is this a newly delivered message? Did the user manually move the message somewhere? Did the user click one of our buttons? Is the user deleting old ham that we want to train on before it dies forever? Outlook does this via examining what Outlook event we are seeing, and looking at meta-data we possibly previously attached to the message. I'm not sure this can be encapsulated well at the moment without adding all our meta-data etc baggage to the base classes.
Most of the *new* code that's needed is defining the abstract concepts and their interfaces, rather than writing code that actually *does* anything - it's building a framework.
*cough* ummm... This is doomed to failure. Code *must* do something to be taken seriously. At the very least, I would expect to see the existing test driver framework running against these "abstract concepts" <wink>
Once the framework is there, most of the code needed to implement the functionality should already be in the project - code to hook into Outlook, to train on a message, to parse mbox files, and so on. It just needs hooking into the framework.
See above <wink>. Mark.