[Spambayes] Rethinking Corpus, mboxutils, life, the world, everything

Tim Stone - Four Stones Expressions tim at fourstonesExpressions.com
Tue Dec 3 13:40:05 2002


12/2/2002 2:27:11 PM, Neale Pickett <neale@woozle.org> wrote:

>So then, Richie Hindle <richie@entrian.com> is all like:
>
>> Only pop3proxy.py uses Corpus to my knowledge - hammiebulk.py imports it,
>> but doesn't seem to use it (?)
>> 
>> I'd like to see more of the existing code using it, but then again I'm not
>> in a hurry to implement the idea myself...
>
>I have to confess that I haven't even looked at Corpus.py yet.
>hammiebulk imports it because it needed it for some verbose variable at
>one point.  But I'm going to read up before I take it out, maybe there's
>something there I can use :)

The Corpus stuff was created in response to primarily the needs of the 
pop3proxy.  That process manages sets of mail for 'the other' clients, like 
Netscape, Opera, OE, etc., for which we don't have any hooks into their 
internals.  The only 'interface' we have from them is their pop3 socket 
datastream.  We can't tell when a message moves around in one of their 
folders, and so we have to keep caches of the mail we receive and give them a 
user interface they can use to train the classifier with the cached mail.  
Corpus and its subclass FileCorpus manage that cache for the pop3proxy.  
Message and its subclass FileMessage wrap each message, giving it an interface 
that is particularly suited for the pop3proxy.  ExpiryCorpus and 
ExpiryFileCorpus allow the cache contents to be age purged, so the cache 
doesn't grow indefinitely.  All of this is quite suitable for the pop3proxy, 
but not at all suitable for the Outlook client, which has plenty of hooks into 
the mail persistence mechanism.

The Corpus is observable, and sends notification of two events: a message 
addition and a message removal.  The Trainer class is an observer, and trains 
a classifier appropriately, based on the kind of trainer it is and whether a 
message is being added to or removed from the corpus it's observing.

In the Outlook client (nearly as I can tell) the idea of a cached corpus is 
nonsense.  Mark can tell when a message moves from one folder to another, and 
can do the training based on the kind of folder, so this 'third party' user 
interface to an observable cache messages is not a paradigm that works for 
outlook.

The other thing involved is the mboxutils and msgs 'legacy'.  This appears to 
be primarily directed at unix-style mailboxes, with the message classes being 
kinda force-fit into some other use-cases.  Clearly unix-style mailboxes 
represent a third message persistence paradigm, a single file with all the 
messages in it, with a recognizable boundary line between.  (btw, it seems 
like it would be fairly easy to screw up this kind of mailbox...)  Hammie* 
uses this stuff, even when it's not training on unix mailboxes, and there's 
code rambling around in there that says "if I'm looking at a mbox, do (a), if 
I'm looking at a directory, do (b), if I'm looking at a ..."  There are 
clearly some valid candidates for abstraction in this arena.

So when I look at Corpus, I think that some further abstraction is necessary.   
Mark saw this instantly, it took me longer.  Specifically, the concept of a 
'corpus' carries some definitional baggage that has to do with training and 
such.  The Corpus class is abstract in definition, but it makes too many 
assumptions about its environment to be abstract *enough*.  I think we should 
refactor and introduce another level of abstraction, perhaps called 'Folder'.  
Here's a strawman:

class Folder:
    """Basic iteration, maybe not much else here"""

    def __getitem__(self, key):
    def keys(self):
    def __iter__(self):
    def makeMessage(self, key):

class Directory(Folder):
    def __init__(self, directory)

class Mbox(Folder):
    def __init__(self, mbox)

class Outlook(Folder):
    def __init__(self, ???)

class FileCorpus(Directory):
    """Observable set of messages"""

class FileCache(FileCorpus):
    """Expirable set of messages"""


class Message:
    """Message wrapper, maybe even is just email.Message"""

class MessageFactory:
    """Abstract factory for Message"""

class FileMessageFactory:
    """Wraps a file system message"""

class OutlookMessageFactory"""
    """Wraps an outlook message, probably only has a key and delegator methods 
to outlook api (?)"""

class SomeOtherMessageFactory:
    """wraps some other kind of message... you get the idea"""

>
>Neale
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
www.fourstonesExpressions.com 





More information about the Spambayes mailing list