[Spambayes] Corpus module (was: Upgrade problem)

Tim Stone - Four Stones Expressions tim at fourstonesExpressions.com
Tue Dec 3 03:01:06 2002


Ok, so I found the message, and here are my thoughts. I freely admit that the 
abstraction was done completely from a single concrete example, that being the 
pop3proxy.  It seems that the competing interests here can be successfully 
resolved by further abstraction.  The 'Corpus' that Mark describes is 
essentially an iterator, which doesn't work well for the pop3proxy, but works 
well for the outlook plugin.

I've spent some time looking at the Hammie/Hammiebulk/mboxutils stuff, along 
with the rfc822/Mailbox/email.* stuff over the last week, and I think that we 
(I) have managed to somewhat reinvent the wheel.  It sounded like a good idea 
to me and Tim1 at the time...

I certainly don't view Corpora as being particularly static.  I view any 
collection of messages that are somehow related as a Corpus.  Perhaps a better  
(more portable) term would have been Folder.  Beats me.  At any rate, I don't 
think anybody is locked in to the classes as they exist right now.  Neale and 
Richie have added/removed stuff they need/don't need from them.  I *would* 
like to see a single abstraction that works for the whole project.  Should we 
start over?  I'm ok with that... - TimS


11/7/2002 10:48:54 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:

>> Laughing and pointing should be directed towards me rather than Tim.
>
>None of that, but some thoughts <wink>.
>
>I think that the classes I posted a while ago suffer from the exact reverse
>problem as your idea.  My idea was to make a "message store" that is largely
>independent of training.  I believe the problem with your design is that it
>deals with the training at the expense of the message store.
>
>Obviously, but worth mentioning, is that there are competing interests here.
>My focus is towards clients, and specifically the outlook one (if there were
>more clients I would be happy to think of them too <wink>).  Alot of the
>focus of this group is towards admins rather than individuals (which is just
>fine!)  But it seems the current thinking is of a corpus as being a fairly
>static, well-controlled set of messages used almost purely for training
>purposes.
>
>For client programs, this may not be practical.  The corpus is a more
>dynamic set of messages - and worse, actually *is* the user's set of
>messages rather than a collection of message copies.
>
>For example, "moving" a message in a corpus may actually mean moving the
>message in the user's real inbox.  This may or may not be what is intended -
>a corpus "move" operation is more about changing a message's classification
>than it is about physically moving pieces of mail around.
>
>> A Corpus wouldn't know how to create Message objects, nor would a Message
>> object know how to create itself - classes *derived from* them would know
>> how to do that.  For instance (totally untested code, probably full of
>> typos) -
>>
>> class Message:
>
>Jeremy and I both posted real code, so starting with something that takes
>that into consideration would be good.
>
>> I may be putting too much
>> into the base class by demanding that the text of the message be given to
>> the constructor - that precludes making FileMessage lazy, and
>> only read the
>> file when it needs to.]
>
>It also defeats the abstract nature of the class.
>
>> 'Corpus' works the same way; again, the details may be naive, but this is
>> the general idea:
>
>I'm hoping I don't sound grumpy, but again, the few systems that already
>exist for this engine are the best ones to use to discover the naivety early
><wink>
>
>> You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an
>> IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on.
>
>I can't quite imagine that at the moment, as per my comments at the top.
>
>Off the top of my head, I believe we need:
>* An abstract "message id"
>* A message classification database, as discussed before - basically just a
>dictionary, keyed by ID, holding either "spam" or "ham".
>* A "corpus" becomes just an enumerator of message IDs for bulk/batch
>training.  It has no move etc operations.
>* A "message store" is capable of returning a message object given its ID.
>* The training API simply takes message objects and updates the probability
>and message databases.
>
>At that level, we really don't need much else - no folders or any other
>grouping of messages.  I'm really not too sure there is much value in adding
>higher-level concepts such as folders or message store "move" operations -
>certainly not at the outset, where there are too many competing
>requirements.
>
>> Yes - this could work using observer objects registered with Corpus
>> objects:
>
>This could work, but may be too simple to be necessary.  If the process of
>re-training a message in the Outlook GUI becomes:
>
>def RetrainMessageAsSpam():
>	# Outlook specific code to get an ID.
>	message = message_store.GetMessage(id)
>	if not classifier.IsSpam(message):
>		classifier.train(message, is_spam=True)
>
>And not a whole lot else, it doesn't seem worth it.  Unfortunately, the
>decision to perform the retrain is the complex, but client specific part.
>Is this a newly delivered message?  Did the user manually move the message
>somewhere?  Did the user click one of our buttons?  Is the user deleting old
>ham that we want to train on before it dies forever?
>
>Outlook does this via examining what Outlook event we are seeing, and
>looking at meta-data we possibly previously attached to the message.  I'm
>not sure this can be encapsulated well at the moment without adding all our
>meta-data etc baggage to the base classes.
>
>> Most of the *new* code that's needed is defining the abstract concepts and
>> their interfaces, rather than writing code that actually *does* anything -
>> it's building a framework.
>
>*cough* ummm...  This is doomed to failure.  Code *must* do something to be
>taken seriously.  At the very least, I would expect to see the existing test
>driver framework running against these "abstract concepts" <wink>
>
>> Once the framework is there, most of the code needed to implement the
>> functionality should already be in the project - code to hook
>> into Outlook,
>> to train on a message, to parse mbox files, and so on.  It just needs
>> hooking into the framework.
>
>See above <wink>.
>
>Mark.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
www.fourstonesExpressions.com 





More information about the Spambayes mailing list