Hi there, First off: I started playing with spambayes last sunday, and it's been a blast so far. I'm using pop3proxy.py, love the brand new web interface. However, I did a cvs up today, and unpickling the database stopped working, as classifier.Bayes became a classic class. After some twiddling I managed to repair it, but now I get AssertionErrors during training: [python:~/code/spambayes] just% ./hammie.py -g mymail/good.mbox.fix Training ham (mymail/good.mbox.fix): 4 Traceback (most recent call last): File "./hammie.py", line 483, in ? main() File "./hammie.py", line 460, in main h.update_probabilities() File "./hammie.py", line 336, in update_probabilities self.bayes.update_probabilities() File "classifier.py", line 327, in update_probabilities assert hamcount <= nham AssertionError Is my db screwed or is it repairable? Just
Lemme answer before Tim gets to ya... This is why you keep a corpus. This is pre-alpha code, and anything that anyone does at any time can screw the world up. You should simply delete your database and retrain it. If you don't have a corpus, go ahead and make one now... <wink> - TimS 11/6/2002 1:55:28 PM, Just van Rossum <just@letterror.com> wrote:
Hi there,
First off: I started playing with spambayes last sunday, and it's been a blast so far. I'm using pop3proxy.py, love the brand new web interface.
However, I did a cvs up today, and unpickling the database stopped working, as classifier.Bayes became a classic class. After some twiddling I managed to repair it, but now I get AssertionErrors during training:
[python:~/code/spambayes] just% ./hammie.py -g mymail/good.mbox.fix Training ham (mymail/good.mbox.fix): 4 Traceback (most recent call last): File "./hammie.py", line 483, in ? main() File "./hammie.py", line 460, in main h.update_probabilities() File "./hammie.py", line 336, in update_probabilities self.bayes.update_probabilities() File "classifier.py", line 327, in update_probabilities assert hamcount <= nham AssertionError
Is my db screwed or is it repairable?
Just
_______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes
- Tim www.fourstonesExpressions.com
Tim Stone - Four Stones Expressions wrote:
Lemme answer before Tim gets to ya...
This is why you keep a corpus. This is pre-alpha code, and anything that anyone does at any time can screw the world up. You should simply delete your database and retrain it. If you don't have a corpus, go ahead and make one now... <wink>
Okelydokely! Hey, it already works so well, why not call it "beta"? <wink> Just
Tim Stone - Four Stones Expressions wrote:
This is why you keep a corpus. This is pre-alpha code, and anything that anyone does at any time can screw the world up. You should simply delete your database and retrain it. If you don't have a corpus, go ahead and make one now... <wink>
Alright, this triggered a feature request in me, which resulted in some hacking activity <wink>. The patch below appends training messages to one of two mbox files ('_pop3proxyspam.mbox' or '_pop3proxyham.mbox' respectively), making it easier to later rebuild the database from scratch, while still being able to train ad hoc with the web interface of pop3proxy.py. Good idea? Just Index: pop3proxy.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v retrieving revision 1.10 diff -c -r1.10 pop3proxy.py *** pop3proxy.py 5 Nov 2002 22:18:56 -0000 1.10 --- pop3proxy.py 6 Nov 2002 21:37:03 -0000 *************** *** 608,615 **** raise SystemExit def onUpload(self, params): ! message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') self.bayes.learn(tokenizer.tokenize(message), isSpam, True) self.push("""<p>Trained on your message. Saving database...</p>""") self.push(" ") # Flush... must find out how to do this properly... --- 608,626 ---- raise SystemExit def onUpload(self, params): ! message = params.get('file') or params.get('text') isSpam = (params['which'] == 'spam') + # Append the message to a file, to make it easier to rebuild + # the database later. + message = message.replace('\r\n', '\n').replace('\r', '\n') + if isSpam: + f = open("_pop3proxyspam.mbox", "a") + else: + f = open("_pop3proxyham.mbox", "a") + f.write("From ???@???\n") # fake From line (XXX good enough?) + f.write(message) + f.write("\n") + f.close() self.bayes.learn(tokenizer.tokenize(message), isSpam, True) self.push("""<p>Trained on your message. Saving database...</p>""") self.push(" ") # Flush... must find out how to do this properly...
[Just van Rossum]
Alright, this triggered a feature request in me, which resulted in some hacking activity <wink>. The patch below appends training messages to one of two mbox files ('_pop3proxyspam.mbox' or '_pop3proxyham.mbox' respectively), making it easier to later rebuild the database from scratch, while still being able to train ad hoc with the web interface of pop3proxy.py. Good idea?
Yes, and it's another reason to create a dedicated "training class" module, so that various clients can at least share an *interface* for doing such stuff (and so that new clients don't have to reinvent these concepts from scratch each time around).
[Tim Peters]
it's another reason to create a dedicated "training class" module, so that various clients can at least share an *interface* for doing such stuff
Tim Stone and I have made a start on this (or rather Tim has and I've poked my nose in) - I mention it because he's away until the weekend and we wouldn't want anyone to duplicate the work. It may be too early to talk details (and slightly rude in Tim's absence - my apologies!) but here's the email I sent to Tim outlining how I thought it might work. I was thinking more about generic Message and Corpus classes than specifically about training. Laughing and pointing should be directed towards me rather than Tim. ------------------------------------------------------------------------- [Tim S]
We would include methods in Corpus to add a message to, remove a message from, move from one to another, with the appropriate untraining/retraining built in. We *could* have a method that, given a message substance (headers and body) would find an existing message in a corpus that matched it (somehow). We would include metadata with the corpus that tells us whether it's a spam/ham/untrained corpus, so the retraining can be done. We could even include a fourth type of corpus (cache) with methods to use expiry data in the message metadata to remove old cache messages...
This is excellent stuff. A Corpus contains Messages. CacheCorpus is a subclass of Corpus that adds the concept of expiry, and contains CachedMessages (CachedMessage being a subclass of Message) that know about their own expiry details (time of creation, size, time of last use, whatever it depends on). That's very neat. A Corpus wouldn't know how to create Message objects, nor would a Message object know how to create itself - classes *derived from* them would know how to do that. For instance (totally untested code, probably full of typos) - class Message: def __init__(self, messageText): """Pass in the text of the message, headers and body.""" # etc. def name(self): """Returns a name for this message which is unique within its corpus.""" raise NotImplementedError class FileMessage(Message): """A Message representing an email stored in a file on disk.""" def __init__(self, pathname): self.pathname = pathname messageFile = open(self.pathname) messageText = messageFile.read() Message.__init__(messageText) messageFile.close() def name(self): return self.pathname ...so the Message class dictates that all Messages must have name unique to their corpus, but doesn't dictate how that name is determined. Concrete Message-derived classes fill in that detail. [I may be putting too much into the base class by demanding that the text of the message be given to the constructor - that precludes making FileMessage lazy, and only read the file when it needs to.] 'Corpus' works the same way; again, the details may be naive, but this is the general idea: class Corpus: """A collection of Message objects.""" def __getitem__(self, messageName): """Makes Corpus act like a dictionary: a la corpus[messageName]""" raise NotImplementedError class DirectoryCorpus(Corpus): """Represents a corpus of messages stored as individual files in a directory. Example: corpus = DirectoryCorpus('mydir', '*.msg')""" def __init__(self, directoryPathname, globPattern): self.directoryPathname = directoryPathname self.globPattern = globPattern self.messageCache = {} # The messages we're read from disk so far. def __getitem__(self, messageName): try: return self.messageCache[messageName] except KeyError: if not fnmatch.fnmatch(messageName, self.globPattern): raise KeyError, "Message name doesn't match naming pattern" pathname = os.path.join(self.directoryPathname, messageName) message = FileMessage(pathname) # May raise IOError - let it. self.messageCache[messageName] = message return message Here I've implemented the laziness idea by only reading the file when it's asked for. Maybe the message cache should go in Corpus - that would be useful for *all* Corpus implementations. You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on.
move [Messages] from one [Corpus] to another, with the appropriate untraining/retraining built in.
Yes - this could work using observer objects registered with Corpus objects: class CorpusObserver: """Derive your class from this and call corpus.addObserver to be informed when things happen to a corpus.""" def onAddMessage(self, corpus, message): """Called when a message is added to a corpus.""" pass # Not NotImlementedError, so that people don't have to # implement *all* the event functions of CorpusObserver. class Corpus: def __init__(self): self.observers = [] # List of CorpusObservers to inform of events def addObserver(self, observer): self.observers.append(observer) def addMessage(self, message): """External code adds messages by calling this - for example, in an OutlookCorpus it would be called as a result of the user dragging a message into the folder.""" self.messageCache[message.name()] = message for observer in self.observers: observer.onAddMessage(self, message) class AutoTrainer(CorpusObserver): """Trains the given classifier when messages are added or removed from the given Ham/Spam corpuses.""" def __init__(self, bayes, hamCorpus, spamCorpus): self.bayes = bayes self.hamCorpus = hamCorpus self.spamCorpus = spamCorpus hamCorpus.addObserver(self) spamCorpus.addObserver(self) def onAddMessage(self, corpus, message): if corpus == self.spamCorpus: self.bayes.learn(tokenize(message), True) else: assert corpus == self.hamCorpus, "Unknown corpus" self.bayes.learn(tokenize(message), False) ...and likewise for removeMessage, onRemoveMessage and unlearn.
I'm going to be travelling for the rest of the week, and may not be able to connect, so you may not hear from me till Friday sometime...
OK. Hopefully this will get to you before you leave, and give you plenty to think about. You might want to run it past Tim Peters, 'cos he's *far* better at this kind of thing than I am (though he's also busy). I think this is the sort of thing he has in mind. Most of the *new* code that's needed is defining the abstract concepts and their interfaces, rather than writing code that actually *does* anything - it's building a framework. Once the framework is there, most of the code needed to implement the functionality should already be in the project - code to hook into Outlook, to train on a message, to parse mbox files, and so on. It just needs hooking into the framework. The mark of a good framework is when you write a tiny little class (like AutoTrainer above for instance) that contains hardly any code but adds a major new feature (in this case, automatic training when moving messages around in Outlook). ------------------------------------------------------------------------- -- Richie Hindle richie@entrian.com
Laughing and pointing should be directed towards me rather than Tim.
None of that, but some thoughts <wink>. I think that the classes I posted a while ago suffer from the exact reverse problem as your idea. My idea was to make a "message store" that is largely independent of training. I believe the problem with your design is that it deals with the training at the expense of the message store. Obviously, but worth mentioning, is that there are competing interests here. My focus is towards clients, and specifically the outlook one (if there were more clients I would be happy to think of them too <wink>). Alot of the focus of this group is towards admins rather than individuals (which is just fine!) But it seems the current thinking is of a corpus as being a fairly static, well-controlled set of messages used almost purely for training purposes. For client programs, this may not be practical. The corpus is a more dynamic set of messages - and worse, actually *is* the user's set of messages rather than a collection of message copies. For example, "moving" a message in a corpus may actually mean moving the message in the user's real inbox. This may or may not be what is intended - a corpus "move" operation is more about changing a message's classification than it is about physically moving pieces of mail around.
A Corpus wouldn't know how to create Message objects, nor would a Message object know how to create itself - classes *derived from* them would know how to do that. For instance (totally untested code, probably full of typos) -
class Message:
Jeremy and I both posted real code, so starting with something that takes that into consideration would be good.
I may be putting too much into the base class by demanding that the text of the message be given to the constructor - that precludes making FileMessage lazy, and only read the file when it needs to.]
It also defeats the abstract nature of the class.
'Corpus' works the same way; again, the details may be naive, but this is the general idea:
I'm hoping I don't sound grumpy, but again, the few systems that already exist for this engine are the best ones to use to discover the naivety early <wink>
You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on.
I can't quite imagine that at the moment, as per my comments at the top. Off the top of my head, I believe we need: * An abstract "message id" * A message classification database, as discussed before - basically just a dictionary, keyed by ID, holding either "spam" or "ham". * A "corpus" becomes just an enumerator of message IDs for bulk/batch training. It has no move etc operations. * A "message store" is capable of returning a message object given its ID. * The training API simply takes message objects and updates the probability and message databases. At that level, we really don't need much else - no folders or any other grouping of messages. I'm really not too sure there is much value in adding higher-level concepts such as folders or message store "move" operations - certainly not at the outset, where there are too many competing requirements.
Yes - this could work using observer objects registered with Corpus objects:
This could work, but may be too simple to be necessary. If the process of re-training a message in the Outlook GUI becomes: def RetrainMessageAsSpam(): # Outlook specific code to get an ID. message = message_store.GetMessage(id) if not classifier.IsSpam(message): classifier.train(message, is_spam=True) And not a whole lot else, it doesn't seem worth it. Unfortunately, the decision to perform the retrain is the complex, but client specific part. Is this a newly delivered message? Did the user manually move the message somewhere? Did the user click one of our buttons? Is the user deleting old ham that we want to train on before it dies forever? Outlook does this via examining what Outlook event we are seeing, and looking at meta-data we possibly previously attached to the message. I'm not sure this can be encapsulated well at the moment without adding all our meta-data etc baggage to the base classes.
Most of the *new* code that's needed is defining the abstract concepts and their interfaces, rather than writing code that actually *does* anything - it's building a framework.
*cough* ummm... This is doomed to failure. Code *must* do something to be taken seriously. At the very least, I would expect to see the existing test driver framework running against these "abstract concepts" <wink>
Once the framework is there, most of the code needed to implement the functionality should already be in the project - code to hook into Outlook, to train on a message, to parse mbox files, and so on. It just needs hooking into the framework.
See above <wink>. Mark.
Ok, so I found the message, and here are my thoughts. I freely admit that the abstraction was done completely from a single concrete example, that being the pop3proxy. It seems that the competing interests here can be successfully resolved by further abstraction. The 'Corpus' that Mark describes is essentially an iterator, which doesn't work well for the pop3proxy, but works well for the outlook plugin. I've spent some time looking at the Hammie/Hammiebulk/mboxutils stuff, along with the rfc822/Mailbox/email.* stuff over the last week, and I think that we (I) have managed to somewhat reinvent the wheel. It sounded like a good idea to me and Tim1 at the time... I certainly don't view Corpora as being particularly static. I view any collection of messages that are somehow related as a Corpus. Perhaps a better (more portable) term would have been Folder. Beats me. At any rate, I don't think anybody is locked in to the classes as they exist right now. Neale and Richie have added/removed stuff they need/don't need from them. I *would* like to see a single abstraction that works for the whole project. Should we start over? I'm ok with that... - TimS 11/7/2002 10:48:54 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:
Laughing and pointing should be directed towards me rather than Tim.
None of that, but some thoughts <wink>.
I think that the classes I posted a while ago suffer from the exact reverse problem as your idea. My idea was to make a "message store" that is largely independent of training. I believe the problem with your design is that it deals with the training at the expense of the message store.
Obviously, but worth mentioning, is that there are competing interests here. My focus is towards clients, and specifically the outlook one (if there were more clients I would be happy to think of them too <wink>). Alot of the focus of this group is towards admins rather than individuals (which is just fine!) But it seems the current thinking is of a corpus as being a fairly static, well-controlled set of messages used almost purely for training purposes.
For client programs, this may not be practical. The corpus is a more dynamic set of messages - and worse, actually *is* the user's set of messages rather than a collection of message copies.
For example, "moving" a message in a corpus may actually mean moving the message in the user's real inbox. This may or may not be what is intended - a corpus "move" operation is more about changing a message's classification than it is about physically moving pieces of mail around.
A Corpus wouldn't know how to create Message objects, nor would a Message object know how to create itself - classes *derived from* them would know how to do that. For instance (totally untested code, probably full of typos) -
class Message:
Jeremy and I both posted real code, so starting with something that takes that into consideration would be good.
I may be putting too much into the base class by demanding that the text of the message be given to the constructor - that precludes making FileMessage lazy, and only read the file when it needs to.]
It also defeats the abstract nature of the class.
'Corpus' works the same way; again, the details may be naive, but this is the general idea:
I'm hoping I don't sound grumpy, but again, the few systems that already exist for this engine are the best ones to use to discover the naivety early <wink>
You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on.
I can't quite imagine that at the moment, as per my comments at the top.
Off the top of my head, I believe we need: * An abstract "message id" * A message classification database, as discussed before - basically just a dictionary, keyed by ID, holding either "spam" or "ham". * A "corpus" becomes just an enumerator of message IDs for bulk/batch training. It has no move etc operations. * A "message store" is capable of returning a message object given its ID. * The training API simply takes message objects and updates the probability and message databases.
At that level, we really don't need much else - no folders or any other grouping of messages. I'm really not too sure there is much value in adding higher-level concepts such as folders or message store "move" operations - certainly not at the outset, where there are too many competing requirements.
Yes - this could work using observer objects registered with Corpus objects:
This could work, but may be too simple to be necessary. If the process of re-training a message in the Outlook GUI becomes:
def RetrainMessageAsSpam(): # Outlook specific code to get an ID. message = message_store.GetMessage(id) if not classifier.IsSpam(message): classifier.train(message, is_spam=True)
And not a whole lot else, it doesn't seem worth it. Unfortunately, the decision to perform the retrain is the complex, but client specific part. Is this a newly delivered message? Did the user manually move the message somewhere? Did the user click one of our buttons? Is the user deleting old ham that we want to train on before it dies forever?
Outlook does this via examining what Outlook event we are seeing, and looking at meta-data we possibly previously attached to the message. I'm not sure this can be encapsulated well at the moment without adding all our meta-data etc baggage to the base classes.
Most of the *new* code that's needed is defining the abstract concepts and their interfaces, rather than writing code that actually *does* anything - it's building a framework.
*cough* ummm... This is doomed to failure. Code *must* do something to be taken seriously. At the very least, I would expect to see the existing test driver framework running against these "abstract concepts" <wink>
Once the framework is there, most of the code needed to implement the functionality should already be in the project - code to hook into Outlook, to train on a message, to parse mbox files, and so on. It just needs hooking into the framework.
See above <wink>.
Mark.
_______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes
c'est moi - TimS www.fourstonesExpressions.com
[Richie Hindle, cogitates about Messages and their Corpus(ora)] That's the ticket! Backing off to a more fundamental level looks useful to me too. We never even straightened that much out for testing purposes (msgs.py isn't general enough; for some custom test drivers (never checked in), I couldn't even reuse the MsgStream class for my *own* directory structures). I disagree with Mark's
If the process of re-training a message in the Outlook GUI becomes:
def RetrainMessageAsSpam(): # Outlook specific code to get an ID. message = message_store.GetMessage(id) if not classifier.IsSpam(message): classifier.train(message, is_spam=True)
And not a whole lot else, it doesn't seem worth it.
because it illustrates the point <wink>: it doesn't look like a correct re-training method (although it may be, depending on assumptions about where "id" comes from, and what assorted classifier methods do), and while a correct method shouldn't be hard, in the absence of a class dedicated to doing the simple common things that *can* be done in a common way, everyone will keep screwing it up in their own client code.
... You might want to run it past Tim Peters, 'cos he's *far* better at this kind of thing than I am (though he's also busy).
I have to do more Python and Zope work now, so have to guard my time on *this* project more jealously than I have. MarkH and SeanT and JeremyH all have ideas here too, and I trust you'll sort them out as a harmonious family bent on world domination. As a general strategy, the first person to check code in usually wins <wink -- but take a clue from Mark, and from the earlier days of this project, and from the pop3 proxy, and sling code more than talk about it -- refactoring in Python is easy when the need becomes apparent from real life>.
... The mark of a good framework is when you write a tiny little class (like AutoTrainer above for instance) that contains hardly any code but adds a major new feature (in this case, automatic training when moving messages around in Outlook).
The client-specific code to hook and track msg movement in Outlook is relatively massive, so everything else appears a drop in the bucket to Mark. Nevertheless, if a usable framework for capturing the *common* part of this stuff were available, removing the 5 lines of code quoted above would help (the Outlook client, and all others).
on 6/11/02 20:55, Just van Rossum at just@letterror.com wrote:
First off: I started playing with spambayes last sunday, and it's been a blast so far. I'm using pop3proxy.py, love the brand new web interface.
Did you installed it on MacOS9 or MacOSX ? -- Le courrier est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : <http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>
François Granger wrote:
Did you installed it on MacOS9 or MacOSX ?
OSX, with unix Python 2.3a. In a way it's too bad spambayes doesn't work with 2.2, so you can't use the Python shipped with 10.2. (In theory it might work under OS9, but I've never had much luck with sockets in MacPython 2.x, but you could try. It uses asyncore and not threading, so that's hopeful for 9.) Just PS: the web interface of pop3proxy.py is pretty good and useful, the only downside is that it saves the database after each training, which makes it hard to train with a few messages: after each message you have to wait (up to 10 seconds on my machine with my database) before you can continue. Maybe an explicit "Save database" button is an idea?
on 7/11/02 10:05, Just van Rossum at just@letterror.com wrote:
François Granger wrote:
Did you installed it on MacOS9 or MacOSX ?
OSX, with unix Python 2.3a. In a way it's too bad spambayes doesn't work with 2.2, so you can't use the Python shipped with 10.2. (In theory it might work under OS9, but I've never had much luck with sockets in MacPython 2.x, but you could try. It uses asyncore and not threading, so that's hopeful for 9.)
I got up to have it running with MacOS9.1 and Python 2.2.1. The Web server works and the proxy answers to a telnet on 127.0.0.1:110. I think I don't get the idea of the setting for the proxy. I give to spambayes my pop3 server name, I then change my account in my mail reader to have it to connect to 127.0.0.1 as a pop3 server. And nothing happens.
after each message you have to wait (up to 10 seconds on my machine with my database) before you can continue. Maybe an explicit "Save database" button is an idea?
With the -d parameter, you can use a anydbm instead of Pickle. With some hack it can probably use gdbm as the anydbm db. -- Le courrier est un moyen de communication. Les gens devraient se poser des questions sur les implications politiques des choix (ou non choix) de leurs outils et technologies. Pour des courriers propres : <http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>
François Granger wrote:
after each message you have to wait (up to 10 seconds on my machine with my database) before you can continue. Maybe an explicit "Save database" button is an idea?
With the -d parameter, you can use a anydbm instead of Pickle. With some hack it can probably use gdbm as the anydbm db.
Right, that's the obvious solution. Thanks. Just
François Granger wrote:
after each message you have to wait (up to 10 seconds on my machine with my database) before you can continue. Maybe an explicit "Save database" button is an idea?
With the -d parameter, you can use a anydbm instead of Pickle. With some hack it can probably use gdbm as the anydbm db.
Ok, so I did it. With my current setup anydbm uses dbhash/bsddb, and training (on a single message) performance seems _worse_ than with the pickle (about 20 seconds now, around 10 with pickle). Don't know whether the training itself is slower or updating the database. Training with my entire corpus took many times longer as well. Not to mention that the database is now 20 megs instead of 5... Would gdbm be expected to work faster? (I currently don't even have it.) Just
On Thu, Nov 7 2002 Just van Rossum wrote:
François Granger wrote:
after each message you have to wait (up to 10 seconds on my machine with my database) before you can continue. Maybe an explicit "Save database" button is an idea?
With the -d parameter, you can use a anydbm instead of Pickle. With some hack it can probably use gdbm as the anydbm db.
Ok, so I did it. With my current setup anydbm uses dbhash/bsddb, and training (on a single message) performance seems _worse_ than with the pickle (about 20 seconds now, around 10 with pickle). Don't know whether the training itself is slower or updating the database. Training with my entire corpus took many times longer as well. Not to mention that the database is now 20 megs instead of 5... Would gdbm be expected to work faster? (I currently don't even have it.)
The problem with training is that the update_probabilities() method which is called at the end goes through the whole database and updates just about every word. So the whole database is touched and needs to be written to disk. -- Sjoerd Mullender <sjoerd@acm.org>
OSX, with unix Python 2.3a. In a way it's too bad spambayes doesn't work with 2.2, so you can't use the Python shipped with 10.2.
Long ago, we settled for Python 2.2 (some people wanted 2.1, but that was unbearable). If you see violations of 2.2 compatibility, please supply patches (we'll also gladly give you checkin permission). (If it makes a difference, I'd prefer aiming for 2.2 compatibility over 2.2.2 compatibility, since 2.2 is probably what comes with MacOS 10.2. Unless it gets too ugly.) --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
Long ago, we settled for Python 2.2 (some people wanted 2.1, but that was unbearable). If you see violations of 2.2 compatibility, please supply patches (we'll also gladly give you checkin permission).
(If it makes a difference, I'd prefer aiming for 2.2 compatibility over 2.2.2 compatibility, since 2.2 is probably what comes with MacOS 10.2. Unless it gets too ugly.)
The docs say 2.2.1 and that's correct: the code is littered with True and False. Those are the only 2.2.1-isms I've seen. But a patch would nevertheless be quite big. Just
Just van Rossum wrote:
Guido van Rossum wrote:
Long ago, we settled for Python 2.2 (some people wanted 2.1, but that was unbearable). If you see violations of 2.2 compatibility, please supply patches (we'll also gladly give you checkin permission).
(If it makes a difference, I'd prefer aiming for 2.2 compatibility over 2.2.2 compatibility, since 2.2 is probably what comes with MacOS 10.2. Unless it gets too ugly.)
The docs say 2.2.1 and that's correct: the code is littered with True and False. Those are the only 2.2.1-isms I've seen. But a patch would nevertheless be quite big.
I just did a quick test with 2.2 (adding True and False to __builtins__ ;-), and the only other 2.2.1-ism is bool(), which is only used in Options.py. After fixing that everything seems to work just fine. I'd be happy to add a this try: True, False except NameError: True, False = 1, 0 to a bunch of files, and patch the docs. Your call. My sf username is "jvr" ;-) Just
[Just]
the web interface of pop3proxy.py is pretty good and useful, the only downside is that it saves the database after each training
That's now fixed (at least partly) along with some other bits: o The database is now saved (optionally) on exit, rather than after each message you train with. There should be explicit save/reload commands, but they can come later. o It now keeps two mbox files of all the messages that have been used to train via the web interface - thanks to Just for the patch. o All the sockets now use async - the web interface used to freeze whenever the proxy was awaiting a response from the POP3 server. That's now fixed. o It now copes with POP3 servers that don't issue a welcome command. o The training form now appears in the training results, so you can train on another message without having to go back to the Home page. -- Richie Hindle richie@entrian.com
pJust van Rossum]
... However, I did a cvs up today, and unpickling the database stopped working, as classifier.Bayes became a classic class. After some twiddling I managed to repair it, but now I get AssertionErrors during training:
I suppose it would have worked to restore the inheritance from object long enough to open the old pickle, then copy the contents into an instance of the changed class and pickle that.
[python:~/code/spambayes] just% ./hammie.py -g mymail/good.mbox.fix Training ham (mymail/good.mbox.fix): 4 Traceback (most recent call last): File "./hammie.py", line 483, in ? main() File "./hammie.py", line 460, in main h.update_probabilities() File "./hammie.py", line 336, in update_probabilities self.bayes.update_probabilities() File "classifier.py", line 327, in update_probabilities assert hamcount <= nham AssertionError
Is my db screwed or is it repairable?
It's obviously screwed, and whether it's repairable depends on exactly what "some twiddling" meant. I'm sure you've built a new from scratch by now, though!
participants (8)
-
François Granger -
Guido van Rossum -
Just van Rossum -
Mark Hammond -
Richie Hindle -
Sjoerd Mullender -
Tim Peters -
Tim Stone - Four Stones Expressions