[Spambayes] Rethinking Corpus, mboxutils, life, the world, everything
Tim Stone - Four Stones Expressions
tim at fourstonesExpressions.com
Tue Dec 3 13:40:05 2002
12/2/2002 2:27:11 PM, Neale Pickett <neale@woozle.org> wrote:
>So then, Richie Hindle <richie@entrian.com> is all like:
>
>> Only pop3proxy.py uses Corpus to my knowledge - hammiebulk.py imports it,
>> but doesn't seem to use it (?)
>>
>> I'd like to see more of the existing code using it, but then again I'm not
>> in a hurry to implement the idea myself...
>
>I have to confess that I haven't even looked at Corpus.py yet.
>hammiebulk imports it because it needed it for some verbose variable at
>one point. But I'm going to read up before I take it out, maybe there's
>something there I can use :)
The Corpus stuff was created in response to primarily the needs of the
pop3proxy. That process manages sets of mail for 'the other' clients, like
Netscape, Opera, OE, etc., for which we don't have any hooks into their
internals. The only 'interface' we have from them is their pop3 socket
datastream. We can't tell when a message moves around in one of their
folders, and so we have to keep caches of the mail we receive and give them a
user interface they can use to train the classifier with the cached mail.
Corpus and its subclass FileCorpus manage that cache for the pop3proxy.
Message and its subclass FileMessage wrap each message, giving it an interface
that is particularly suited for the pop3proxy. ExpiryCorpus and
ExpiryFileCorpus allow the cache contents to be age purged, so the cache
doesn't grow indefinitely. All of this is quite suitable for the pop3proxy,
but not at all suitable for the Outlook client, which has plenty of hooks into
the mail persistence mechanism.
The Corpus is observable, and sends notification of two events: a message
addition and a message removal. The Trainer class is an observer, and trains
a classifier appropriately, based on the kind of trainer it is and whether a
message is being added to or removed from the corpus it's observing.
In the Outlook client (nearly as I can tell) the idea of a cached corpus is
nonsense. Mark can tell when a message moves from one folder to another, and
can do the training based on the kind of folder, so this 'third party' user
interface to an observable cache messages is not a paradigm that works for
outlook.
The other thing involved is the mboxutils and msgs 'legacy'. This appears to
be primarily directed at unix-style mailboxes, with the message classes being
kinda force-fit into some other use-cases. Clearly unix-style mailboxes
represent a third message persistence paradigm, a single file with all the
messages in it, with a recognizable boundary line between. (btw, it seems
like it would be fairly easy to screw up this kind of mailbox...) Hammie*
uses this stuff, even when it's not training on unix mailboxes, and there's
code rambling around in there that says "if I'm looking at a mbox, do (a), if
I'm looking at a directory, do (b), if I'm looking at a ..." There are
clearly some valid candidates for abstraction in this arena.
So when I look at Corpus, I think that some further abstraction is necessary.
Mark saw this instantly, it took me longer. Specifically, the concept of a
'corpus' carries some definitional baggage that has to do with training and
such. The Corpus class is abstract in definition, but it makes too many
assumptions about its environment to be abstract *enough*. I think we should
refactor and introduce another level of abstraction, perhaps called 'Folder'.
Here's a strawman:
class Folder:
"""Basic iteration, maybe not much else here"""
def __getitem__(self, key):
def keys(self):
def __iter__(self):
def makeMessage(self, key):
class Directory(Folder):
def __init__(self, directory)
class Mbox(Folder):
def __init__(self, mbox)
class Outlook(Folder):
def __init__(self, ???)
class FileCorpus(Directory):
"""Observable set of messages"""
class FileCache(FileCorpus):
"""Expirable set of messages"""
class Message:
"""Message wrapper, maybe even is just email.Message"""
class MessageFactory:
"""Abstract factory for Message"""
class FileMessageFactory:
"""Wraps a file system message"""
class OutlookMessageFactory"""
"""Wraps an outlook message, probably only has a key and delegator methods
to outlook api (?)"""
class SomeOtherMessageFactory:
"""wraps some other kind of message... you get the idea"""
>
>Neale
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
c'est moi - TimS
www.fourstonesExpressions.com
More information about the Spambayes
mailing list