[Spambayes-checkins] spambayes notesfilter.py,1.1,1.2

Tim Stone timstone4 at users.sourceforge.net
Fri Mar 28 05:57:55 EST 2003


Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv28054

Modified Files:
	notesfilter.py 
Log Message:
A unicode print error fix

Index: notesfilter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/notesfilter.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** notesfilter.py	25 Feb 2003 18:25:12 -0000	1.1
--- notesfilter.py	28 Mar 2003 13:57:52 -0000	1.2
***************
*** 17,26 ****
          Train as Spam
          Train as Ham
          
!     It classifies mail that is in the inbox.  Mail that is classified
!     as spam is moved to the Spam folder.  Mail that is to be trained
!     as spam should be manually moved to that folder by the user.
!     Likewise mail that is to be trained as ham.  After training, spam
!     is moved to the Spam folder and ham is moved to the Ham folder.
      
      Because there is no programmatic way to determine if a particular
--- 17,68 ----
          Train as Spam
          Train as Ham
+ 
+     Depending on the execution parameters, it will do any or all of the
+     following steps, in the order given.
+ 
+     1. Train Spam from the Train as Spam folder (-t option)
+     2. Train Ham from the Train as Ham folder (-t option)
+     3. Replicate (-r option)
+     4. Classify the inbox (-c option)
          
!     Mail that is to be trained as spam should be manually moved to
!     that folder by the user. Likewise mail that is to be trained as
!     ham.  After training, spam is moved to the Spam folder and ham is
!     moved to the Ham folder.
! 
!     Replication takes place if a remote server has been specified.
!     This step may take a long time, depending on replication
!     parameters and how much information there is to download, as well
!     as line speed and server load.  Please be patient if you run with
!     replication.  There is currently no progress bar or anything like
!     that to tell you that it's working, but it is and will complete
!     eventually.  There is also no mechanism for notifying you that the
!     replication failed.  If it did, there is no harm done, and the program
!     will continue execution.
! 
!     Mail that is classified as Spam is moved from the inbox to the
!     Train as Spam folder.  You should occasionally review your Spam
!     folder for Ham that has mistakenly been classified as Spam.  If
!     there is any there, move it to the Train as Ham folder, so
!     Spambayes will be less likely to make this mistake again.
! 
!     Mail that is classified as Ham or Unsure is left in the inbox.
!     There is currently no means of telling if a mail was classified as
!     Ham or Unsure.
! 
!     You should occasionally select some Ham and move it to the Train
!     as Ham folder, so Spambayes can tell the difference between Spam
!     and Ham. The goal is to maintain a relative balance between the
!     number of Spam and the number of Ham that have been trained into
!     the database. These numbers are reported every time this program
!     executes.  However, if the amount of Spam you receive far exceeds
!     the amount of Ham you receive, it may be very difficult to
!     maintain this balance.  This is not a matter of great concern.
!     Spambayes will still make very few mistakes in this circumstance.
!     But, if this is the case, you should review your Spam folder for
!     falsely classified Ham, and retrain those that you find, on a
!     regular basis.  This will prevent statistical error accumulation,
!     which if allowed to continue, would cause Spambayes to tend to
!     classify everything as Spam.
      
      Because there is no programmatic way to determine if a particular
***************
*** 28,37 ****
      it keeps a pickled dictionary of notes mail ids, so that once a
      mail has been classified, it will not be classified again.  The
!     non-existence of is index file, named <local database>.'sbindex',
!     indicates to the system that this is the first time it has been
!     run.  Rather than classify the inbox in this case, the contents of
!     the inbox are placed in the index to note the 'starting point' of
!     the system.  After that, any new messages in the inbox are
!     eligible for classification.
  
  Usage:
--- 70,79 ----
      it keeps a pickled dictionary of notes mail ids, so that once a
      mail has been classified, it will not be classified again.  The
!     non-existence of is index file, named <local database>.sbindex,
!     indicates to the system that this is an initialization execution.
!     Rather than classify the inbox in this case, the contents of the
!     inbox are placed in the index to note the 'starting point' of the
!     system.  After that, any new messages in the inbox are eligible
!     for classification.
  
  Usage:
***************
*** 40,44 ****
  	note: option values with spaces in them must be enclosed
  	      in double quotes
! 	      
          options:
              -d  dbname  : pickled training database filename
--- 82,86 ----
  	note: option values with spaces in them must be enclosed
  	      in double quotes
! 
          options:
              -d  dbname  : pickled training database filename
***************
*** 57,60 ****
--- 99,106 ----
              -c          : classify inbox
              -h          : help
+             -p          : prompt "Press Enter to end" before ending
+                           This is useful for automated executions where the
+                           statistics output would otherwise be lost when the
+                           window closes.
  
  Examples:
***************
*** 71,78 ****
  To Do:
      o Dump/purge notesindex file
-     o Show h:s ratio, make recommendations
      o Create correct folders if they do not exist
      o Options for some of this stuff?
      o pop3proxy style training/configuration interface?
      o Suggestions?
      '''
--- 117,124 ----
  To Do:
      o Dump/purge notesindex file
      o Create correct folders if they do not exist
      o Options for some of this stuff?
      o pop3proxy style training/configuration interface?
+     o parameter to retrain?
      o Suggestions?
      '''
***************
*** 83,87 ****
  
  __author__ = "Tim Stone <tim at fourstonesExpressions.com>"
! __credits__ = "Mark Hammond, for his remarkable win32 module."
  
  from __future__ import generators
--- 129,133 ----
  
  __author__ = "Tim Stone <tim at fourstonesExpressions.com>"
! __credits__ = "Mark Hammond, for his remarkable win32 modules."
  
  from __future__ import generators
***************
*** 101,124 ****
  import errno
  import win32com.client
  import getopt
  
  
! def classifyInbox(v, vmoveto, bayes, ldbname):
  
      # the notesindex hash ensures that a message is looked at only once
  
!     try:
!         fp = open("%s.sbindex" % (ldbname), 'rb')
!     except IOError, e:
!         if e.errno != errno.ENOENT: raise
!         notesindex = {}
!         print "notesindex file not found, this is a first time run"
!         print "No classification will be performed"
          firsttime = 1
      else:
-         notesindex = pickle.load(fp)
-         fp.close()
          firsttime = 0
! 
      docstomove = []
      numham = 0
--- 147,163 ----
  import errno
  import win32com.client
+ import pywintypes
  import getopt
  
  
! def classifyInbox(v, vmoveto, bayes, ldbname, notesindex):
  
      # the notesindex hash ensures that a message is looked at only once
  
!     if len(notesindex.keys()) == 0:
          firsttime = 1
      else:
          firsttime = 0
!         
      docstomove = []
      numham = 0
***************
*** 126,129 ****
--- 165,169 ----
      numuns = 0
      numdocs = 0
+     
      doc = v.GetFirstDocument()
      while doc:
***************
*** 135,138 ****
--- 175,185 ----
  
                  numdocs += 1
+ 
+                 # Notes returns strings in unicode, and the Python
+                 # uni-decoder has trouble with these strings when
+                 # you try to print them.  So don't...
+ 
+                 # The com interface returns basic data types as tuples
+                 # only, thus the subscript on GetItemValue
                  
                  try:
***************
*** 146,152 ****
                      body = 'No Body'
  
!                 message = "Subject: %s\r\n%s" % (subj, body)
  
!                 # generate_long_skips = True blows up on occ.
                  options.generate_long_skips = False
                  tokens = tokenizer.tokenize(message)
--- 193,200 ----
                      body = 'No Body'
  
!                 message = "Subject: %s\r\n\r\n%s" % (subj, body)
  
!                 # generate_long_skips = True blows up on occasion,
!                 # probably due to this unicode problem.
                  options.generate_long_skips = False
                  tokens = tokenizer.tokenize(message)
***************
*** 164,171 ****
                      numuns += 1
  
!                 notesindex[nid] = disposition
  
          doc = v.GetNextDocument(doc)
  
      for doc in docstomove:
          doc.RemoveFromFolder(v.Name)
--- 212,225 ----
                      numuns += 1
  
!                 notesindex[nid] = 'classified'
!                 try:
!                     print "%s spamprob is %s" % (subj[:30], prob)
!                 except UnicodeError:
!                     print "<subject not printed> spamprob is %s" % (prob)
  
          doc = v.GetNextDocument(doc)
  
+     # docstomove list is built because moving documents in the middle of
+     # the classification loop looses the iterator position
      for doc in docstomove:
          doc.RemoveFromFolder(v.Name)
***************
*** 177,190 ****
      print "   %s classified as unsure" % (numuns)
      
-     fp = open("timstone.nsf.sbindex", 'wb')
-     pickle.dump(notesindex, fp)
-     fp.close()
  
! def processAndTrain(v, vmoveto, bayes, is_spam):
  
      if is_spam:
!         str = "spam"
      else:
!         str = "ham"
  
      print "Training %s" % (str)
--- 231,241 ----
      print "   %s classified as unsure" % (numuns)
      
  
! def processAndTrain(v, vmoveto, bayes, is_spam, notesindex):
  
      if is_spam:
!         str = options.header_spam_string
      else:
!         str = options.header_ham_string
  
      print "Training %s" % (str)
***************
*** 207,214 ****
          options.generate_long_skips = False
          tokens = tokenizer.tokenize(message)
          bayes.learn(tokens, is_spam)
  
          docstomove += [doc]
- 
          doc = v.GetNextDocument(doc)
  
--- 258,276 ----
          options.generate_long_skips = False
          tokens = tokenizer.tokenize(message)
+ 
+         nid = doc.NOTEID
+         if notesindex.has_key(nid):
+             trainedas = notesindex[nid]
+             if trainedas == options.header_spam_string and not is_spam:
+                 # msg is trained as spam, is to be retrained as ham
+                 bayes.unlearn(tokens, True)
+             elif trainedas == options.header_ham_string and is_spam:
+                 # msg is trained as ham, is to be retrained as spam
+                 bayes.unlearn(tokens, False)
+   
          bayes.learn(tokens, is_spam)
  
+         notesindex[nid] = str
          docstomove += [doc]
          doc = v.GetNextDocument(doc)
  
***************
*** 218,221 ****
--- 280,284 ----
  
      print "%s documents trained" % (len(docstomove))
+     
  
  def run(bdbname, useDBM, ldbname, rdbname, foldname, doTrain, doClassify):
***************
*** 225,231 ****
      else:
          bayes = storage.PickledClassifier(bdbname)
!     
      sess = win32com.client.Dispatch("Lotus.NotesSession")
!     sess.initialize()
      db = sess.GetDatabase("",ldbname)
      
--- 288,311 ----
      else:
          bayes = storage.PickledClassifier(bdbname)
! 
!     try:
!         fp = open("%s.sbindex" % (ldbname), 'rb')
!     except IOError, e:
!         if e.errno != errno.ENOENT: raise
!         notesindex = {}
!         print "%s.sbindex file not found, this is a first time run" \
!               % (ldbname)
!         print "No classification will be performed"
!     else:
!         notesindex = pickle.load(fp)
!         fp.close()
!      
      sess = win32com.client.Dispatch("Lotus.NotesSession")
!     try:
!         sess.initialize()
!     except pywintypes.com_error:
!         print "Session aborted"
!         sys.exit()
!         
      db = sess.GetDatabase("",ldbname)
      
***************
*** 236,239 ****
--- 316,324 ----
      vtrainham = db.getView("%s\Train as Ham" % (foldname))
      
+     if doTrain:
+         processAndTrain(vtrainspam, vspam, bayes, True, notesindex)
+         # for some reason, using inbox as a target here loses the mail
+         processAndTrain(vtrainham, vham, bayes, False, notesindex)
+         
      if rdbname:
          print "Replicating..."
***************
*** 241,258 ****
          print "Done"
          
-     if doTrain:
-         processAndTrain(vtrainspam, vspam, bayes, True)
-         # for some reason, using inbox as a target here loses the mail
-         processAndTrain(vtrainham, vham, bayes, False)
-         
      if doClassify:
!         classifyInbox(vinbox, vspam, bayes, ldbname)
!     
      bayes.store()
  
  if __name__ == '__main__':
  
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'htcd:D:l:r:f:')
      except getopt.error, msg:
          print >>sys.stderr, str(msg) + '\n\n' + __doc__
--- 326,346 ----
          print "Done"
          
      if doClassify:
!         classifyInbox(vinbox, vtrainspam, bayes, ldbname, notesindex)
! 
!     print "The Spambayes database currently has %s Spam and %s Ham" \
!         % (bayes.nspam, bayes.nham)
! 
      bayes.store()
  
+     fp = open("%s.sbindex" % (ldbname), 'wb')
+     pickle.dump(notesindex, fp)
+     fp.close()
+     
+ 
  if __name__ == '__main__':
  
      try:
!         opts, args = getopt.getopt(sys.argv[1:], 'htcpd:D:l:r:f:')
      except getopt.error, msg:
          print >>sys.stderr, str(msg) + '\n\n' + __doc__
***************
*** 265,268 ****
--- 353,357 ----
      doTrain = False
      doClassify = False
+     doPrompt = False
  
      for opt, arg in opts:
***************
*** 286,293 ****
--- 375,390 ----
          elif opt == '-c':
              doClassify = True
+         elif opt == '-p':
+             doPrompt = True
  
      if (bdbname and ldbname and sbfname and (doTrain or doClassify)):
          run(bdbname, useDBM, ldbname, rdbname, \
              sbfname, doTrain, doClassify)
+ 
+         if doPrompt:
+             try:
+                 key = input("Press Enter to end")
+             except SyntaxError:
+                 pass
      else:
          print >>sys.stderr, __doc__





More information about the Spambayes-checkins mailing list