[Spambayes-checkins] spambayes/Outlook2000 train.py,1.17,1.18

Wed Nov 13 19:26:30 2002

Update of /cvsroot/spambayes/spambayes/Outlook2000
In directory usw-pr-cvs1:/tmp/cvs-serv31686/Outlook2000

Modified Files:
	train.py 
Log Message:
train_message():

Bugfix:  If a msg was incorrectly classified, untraining from the wrong
category worked fine, but training for the new category had no effect.
That's because tokenize() returns an iterator rather than a sequence,
and after you've run thru the end of the iterator once (as unlearning
did do), trying to run thru it again simply yields an empty sequence.
So called tokenize() anew whenever needed.  Tranforming into a sequence
via list() or tuple() would also have worked, but the case in which
the tokenstream *can* be reused is too rare to worry about.

Optimization:  Don't bother tokenizing, or even materializing a msg
object, if the msg has already been trained with the correct
classification.  Incremental training goes at light speed now.

Index: train.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Outlook2000/train.py,v
retrieving revision 1.17
retrieving revision 1.18
diff -C2 -d -r1.17 -r1.18
*** train.py	13 Nov 2002 05:29:10 -0000	1.17
--- train.py	13 Nov 2002 19:26:27 -0000	1.18
***************
*** 34,54 ****
      # be written to the message (so the user can see some effects)
      from tokenizer import tokenize
!     stream = msg.GetEmailPackageObject()
!     tokens = tokenize(stream)
!     # Handle we may have already been trained.
      was_spam = mgr.message_db.get(msg.searchkey)
!     if was_spam is None:
!         # never previously trained.
!         pass
!     elif was_spam == is_spam:
!         # Already in DB - do nothing (full retrain will wipe msg db)
!         # leave now.
!         return False
!     else:
!         mgr.bayes.unlearn(tokens, was_spam, False)
!     # OK - setup the new data.
!     mgr.bayes.learn(tokens, is_spam, False)
      mgr.message_db[msg.searchkey] = is_spam
      mgr.bayes_dirty = True
      # Simplest way to rescore is to re-filter with all_actions = False
      if rescore:
--- 34,53 ----
      # be written to the message (so the user can see some effects)
      from tokenizer import tokenize
! 
      was_spam = mgr.message_db.get(msg.searchkey)
!     if was_spam == is_spam:
!         return False    # already correctly classified
! 
!     # Brand new (was_spam is None), or incorrectly classified.
!     stream = msg.GetEmailPackageObject()
!     if was_spam is not None:
!         # The classification has changed; unlearn the old classification.
!         mgr.bayes.unlearn(tokenize(stream), was_spam, False)
! 
!     # Learn the correct classification.
!     mgr.bayes.learn(tokenize(stream), is_spam, False)
      mgr.message_db[msg.searchkey] = is_spam
      mgr.bayes_dirty = True
+ 
      # Simplest way to rescore is to re-filter with all_actions = False
      if rescore:
***************
*** 59,63 ****
      return True

! def train_folder( f, isspam, mgr, progress):
      num = num_added = 0
      for message in f.GetMessageGenerator():
--- 58,62 ----
      return True

! def train_folder(f, isspam, mgr, progress):
      num = num_added = 0
      for message in f.GetMessageGenerator():