[Spambayes-checkins] spambayes/spambayes smtpproxy.py, NONE, 1.1 Corpus.py, 1.7, 1.8 FileCorpus.py, 1.6, 1.7 ImapUI.py, 1.18, 1.19 Options.py, 1.79, 1.80 ProxyUI.py, 1.23, 1.24 UserInterface.py, 1.24, 1.25 mboxutils.py, 1.2, 1.3 message.py, 1.37, 1.38

Tony Meyer anadelonbrin at users.sourceforge.net
Fri Sep 19 19:38:12 EDT 2003


Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv12134/spambayes

Modified Files:
	Corpus.py FileCorpus.py ImapUI.py Options.py ProxyUI.py 
	UserInterface.py mboxutils.py message.py 
Added Files:
	smtpproxy.py 
Log Message:
If anyone wants to use the smtp proxy, then they can do so via sb_server, with or
without using the pop3 proxy as well.  This means that sb_smtpproxy doesn't really
need to exist anymore, and the smtpproxy stuff would be better as a module.  Do this.

We had too many message classes!  As discussed (by me and Mark, mostly) a long
time back on spambayes-dev, start consolidating these (I was waiting for 1.1).

Add the various interface improvements discussed on spambayes-dev.  In particular,
an advanced 'find token' query is available, the 'find message' query is improved,
and the review messages page is more customisable.

--- NEW FILE: smtpproxy.py ---
#!/usr/bin/env python

"""A SMTP proxy to train a Spambayes database.

You point SMTP Proxy at your SMTP server(s) and configure your email
client(s) to send mail through the proxy (i.e. usually this means you use
localhost as the outgoing server).

To setup, enter appropriate values in your Spambayes configuration file in
the "SMTP Proxy" section (in particular: "remote_servers", "listen_ports",
and "use_cached_message").  This configuration can also be carried out via
the web user interface offered by POP3 Proxy and IMAP Filter.

To use, simply forward/bounce mail that you wish to train to the
appropriate address (defaults to spambayes_spam at localhost and
spambayes_ham at localhost).  All other mail is sent normally.
(Note that IMAP Filter and POP3 Proxy users should not execute this script;
launching of SMTP Proxy will be taken care of by those applicatons).

There are two main forms of operation.  With both, mail to two
(user-configurable) email addresses is intercepted by the proxy (and is
*not* sent to the SMTP server) and used as training data for a Spambayes
database.  All other mail is simply relayed to the SMTP server.

If the "use_cached_message" option is False, the proxy uses the message
sent as training data.  This option is suitable for those not using
POP3 Proxy or IMAP Filter, or for those that are confident that their
mailer will forward/bounce messages in an unaltered form.

If the "use_cached_message" option is True, the proxy examines the message
for a unique spambayes identification number.  It then tries to find this
message in the pop3proxy caches and on the imap servers.  It then retrieves
the message from the cache/server and uses *this* as the training data.
This method is suitable for those using POP3 Proxy and/or IMAP Filter, and
avoids any potential problems with the mailer altering messages before
forwarding/bouncing them.

To use, enter the required SMTP server data in your configuration file and
run sb_server.py
"""

# This module is part of the spambayes project, which is Copyright 2002-3
# The Python Software Foundation and is covered by the Python Software
# Foundation license.

__author__ = "Tony Meyer <ta-meyer at ihug.co.nz>"
__credits__ = "Tim Stone, all the Spambayes folk."

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0


todo = """
 o It would be nice if spam/ham could be bulk forwarded to the proxy,
   rather than one by one.  This would require separating the different
   messages and extracting the correct ids.  Simply changing to find
   *all* the ids in a message, rather than stopping after one *might*
   work, but I don't really know.  Richie Hindle suggested something along
   these lines back in September '02.
   
 o Suggestions?

Testing:

 o Test with as many clients as possible to check that the
   id is correctly extracted from the forwarded/bounced message.

MUA information:
A '*' in the Header column signifies that the smtpproxy can extract
the id from the headers only.  A '*' in the Body column signifies that
the smtpproxy can extract the id from the body of the message, if it
is there.
                                                        Header	Body
*** Windows 2000 MUAs ***
Eudora 5.2 Forward                                         *     *
Eudora 5.2 Redirect                                              *
Netscape Messenger (4.7) Forward (inline)                  *     *
Netscape Messenger (4.7) Forward (quoted) Plain      	         *
Netscape Messenger (4.7) Forward (quoted) HTML      	         *
Netscape Messenger (4.7) Forward (quoted) Plain & HTML       	 *       
Netscape Messenger (4.7) Forward (attachment) Plain 	   *     *	 
Netscape Messenger (4.7) Forward (attachment) HTML  	   *	 *
Netscape Messenger (4.7) Forward (attachment) Plain & HTML *  	 *
Outlook Express 6 Forward HTML (Base64)                          *
Outlook Express 6 Forward HTML (None)                            *
Outlook Express 6 Forward HTML (QP)                              *
Outlook Express 6 Forward Plain (Base64)                         *
Outlook Express 6 Forward Plain (None)                           *
Outlook Express 6 Forward Plain (QP)                             *
Outlook Express 6 Forward Plain (uuencoded)                      *
http://www.endymion.com/products/mailman Forward	             *
M2 (Opera Mailer 7.01) Forward                                   *
M2 (Opera Mailer 7.01) Redirect                            *     *
The Bat! 1.62i Forward (RFC Headers not visible)                 *
The Bat! 1.62i Forward (RFC Headers visible)               *     *
The Bat! 1.62i Redirect                                          *
The Bat! 1.62i Alternative Forward                         *     *
The Bat! 1.62i Custom Template                             *     *
AllegroMail 2.5.0.2 Forward                                      *
AllegroMail 2.5.0.2 Redirect                                     *
PocoMail 2.6.3 Bounce                                            *
PocoMail 2.6.3 Bounce                                            *
Pegasus Mail 4.02 Forward (all headers option set)         *     *
Pegasus Mail 4.02 Forward (all headers option not set)           *
Calypso 3 Forward                                                *
Calypso 3 Redirect                                         *     *
Becky! 2.05.10 Forward                                           *
Becky! 2.05.10 Redirect                                          *
Becky! 2.05.10 Redirect as attachment                      *     *
Mozilla Mail 1.2.1 Forward (attachment)                    *     *
Mozilla Mail 1.2.1 Forward (inline, plain)                 *1    *
Mozilla Mail 1.2.1 Forward (inline, plain & html)          *1    *
Mozilla Mail 1.2.1 Forward (inline, html)                  *1    *

*1 The header method will only work if auto-include original message
is set, and if view all headers is true.
"""

import string
import re
import socket
import asyncore
import asynchat
import getopt
import sys
import os

from spambayes import Dibbler
from spambayes import storage
from spambayes.message import sbheadermessage_from_string
from spambayes.tokenizer import textparts
from spambayes.tokenizer import try_to_repair_damaged_base64
from spambayes.Options import options
from sb_server import _addressPortStr, ServerLineReader
from sb_server import _addressAndPort

class SMTPProxyBase(Dibbler.BrighterAsyncChat):
    """An async dispatcher that understands SMTP and proxies to a SMTP
    server, calling `self.onTransaction(command, args)` for each
    transaction.

    self.onTransaction() should return the command to pass to
    the proxied server - the command can be the verbatim command or a
    processed version of it.  The special command 'KILL' kills it (passing
    a 'QUIT' command to the server).
    """

    def __init__(self, clientSocket, serverName, serverPort):
        Dibbler.BrighterAsyncChat.__init__(self, clientSocket)
        self.request = ''
        self.set_terminator('\r\n')
        self.command = ''           # The SMTP command being processed...
        self.args = ''              # ...and its arguments
        self.isClosing = False      # Has the server closed the socket?
        self.inData = False
        self.data = ""
        self.blockData = False
        self.serverSocket = ServerLineReader(serverName, serverPort,
                                             self.onServerLine)

    def onTransaction(self, command, args):
        """Overide this.  Takes the raw command and returns the (possibly
        processed) command to pass to the email client."""
        raise NotImplementedError

    def onProcessData(self, data):
        """Overide this.  Takes the raw data and returns the (possibly
        processed) data to pass back to the email client."""
        raise NotImplementedError

    def onServerLine(self, line):
        """A line of response has been received from the SMTP server."""
        # Has the server closed its end of the socket?
        if not line:
            self.isClosing = True

        # We don't process the return, just echo the response.
        self.push(line)
        self.onResponse()

    def collect_incoming_data(self, data):
        """Asynchat override."""
        self.request = self.request + data

    def found_terminator(self):
        """Asynchat override."""
        verb = self.request.strip().upper()
        if verb == 'KILL':
            self.socket.shutdown(2)
            self.close()
            raise SystemExit

        if self.request.strip() == '':
            # Someone just hit the Enter key.
            self.command = self.args = ''
        else:
            # A proper command.
            if self.request[:10].upper() == "MAIL FROM:":
                splitCommand = self.request.split(":", 1)
            elif self.request[:8].upper() == "RCPT TO:":
                splitCommand = self.request.split(":", 1)
            else:
                splitCommand = self.request.strip().split(None, 1)
            self.command = splitCommand[0]
            self.args = splitCommand[1:]

        if self.inData == True:
            self.data += self.request + '\r\n'
            if self.request == ".":
                self.inData = False
                cooked = self.onProcessData(self.data)
                self.data = ""
                if self.blockData == False:
                    self.serverSocket.push(cooked)
                else:
                    self.push("250 OK\r\n")
        else:
            cooked = self.onTransaction(self.command, self.args)
            if cooked is not None:
                self.serverSocket.push(cooked + '\r\n')
        self.command = self.args = self.request = ''

    def onResponse(self):
        # If onServerLine() decided that the server has closed its
        # socket, close this one when the response has been sent.
        if self.isClosing:
            self.close_when_done()

        # Reset.
        self.command = ''
        self.args = ''
        self.isClosing = False


class BayesSMTPProxyListener(Dibbler.Listener):
    """Listens for incoming email client connections and spins off
    BayesSMTPProxy objects to serve them."""

    def __init__(self, serverName, serverPort, proxyPort, trainer):
        proxyArgs = (serverName, serverPort, trainer)
        Dibbler.Listener.__init__(self, proxyPort, BayesSMTPProxy,
                                  proxyArgs)
        print 'SMTP Listener on port %s is proxying %s:%d' % \
               (_addressPortStr(proxyPort), serverName, serverPort)


class BayesSMTPProxy(SMTPProxyBase):
    """Proxies between an email client and a SMTP server, inserting
    judgement headers.  It acts on the following SMTP commands:

    o RCPT TO:
        o Checks if the recipient address matches the key ham or spam
          addresses, and if so notes this and does not forward a command to
          the proxied server.  In all other cases simply passes on the
          verbatim command.

     o DATA:
        o Notes that we are in the data section.  If (from the RCPT TO
          information) we are receiving a ham/spam message to train on,
          then do not forward the command on.  Otherwise forward verbatim.

    Any other commands are merely passed on verbatim to the server.          
    """

    def __init__(self, clientSocket, serverName, serverPort, trainer):
        SMTPProxyBase.__init__(self, clientSocket, serverName, serverPort)
        self.handlers = {'RCPT TO': self.onRcptTo, 'DATA': self.onData,
                         'MAIL FROM': self.onMailFrom}
        self.trainer = trainer
        self.isClosed = False
        self.train_as_ham = False
        self.train_as_spam = False

    def send(self, data):
        try:
            return SMTPProxyBase.send(self, data)
        except socket.error:
            # The email client has closed the connection - 40tude Dialog
            # does this immediately after issuing a QUIT command,
            # without waiting for the response.
            self.close()

    def close(self):
        # This can be called multiple times by async.
        if not self.isClosed:
            self.isClosed = True
            SMTPProxyBase.close(self)

    def stripAddress(self, address):
        """
        Strip the leading & trailing <> from an address.  Handy for
        getting FROM: addresses.
        """
        if '<' in address:
            start = string.index(address, '<') + 1
            end = string.index(address, '>')
            return address[start:end]
        else:
            return address

    def onTransaction(self, command, args):
        handler = self.handlers.get(command.upper(), self.onUnknown)
        return handler(command, args)

    def onProcessData(self, data):
        if self.train_as_spam:
            self.trainer.train(data, True)
            self.train_as_spam = False
            return ""
        elif self.train_as_ham:
            self.trainer.train(data, False)
            self.train_as_ham = False
            return ""
        return data

    def onRcptTo(self, command, args):
        toFull = self.stripAddress(args[0])
        if toFull == options["smtpproxy", "spam_address"]:
            self.train_as_spam = True
            self.train_as_ham = False
            self.blockData = True
            self.push("250 OK\r\n")
            return None
        elif toFull == options["smtpproxy", "ham_address"]:
            self.train_as_ham = True
            self.train_as_spam = False
            self.blockData = True
            self.push("250 OK\r\n")
            return None
        else:
            self.blockData = False
        return "%s:%s" % (command, ' '.join(args))
        
    def onData(self, command, args):
        self.inData = True
        if self.train_as_ham == True or self.train_as_spam == True:
            self.push("250 OK\r\n")
            return None
        rv = command
        for arg in args:
            rv += ' ' + arg
        return rv

    def onMailFrom(self, command, args):
        """Just like the default handler, but has the necessary colon."""
        rv = "%s:%s" % (command, ' '.join(args))
        return rv

    def onUnknown(self, command, args):
        """Default handler."""
        return self.request


class SMTPTrainer(object):
    def __init__(self, classifier, state=None, imap=None):
        self.classifier = classifier
        self.state = state
        self.imap = imap
    
    def extractSpambayesID(self, data):
        msg = message_from_string(data)

        # The nicest MUA is one that forwards the header intact.
        id = msg.get(options["Headers", "mailid_header_name"])
        if id is not None:
            return id

        # Some MUAs will put it in the body somewhere, while others will
        # put it in an attached MIME message.
        id = self._find_id_in_text(msg.as_string())
        if id is not None:
            return id

        # the message might be encoded
        for part in textparts(msg):
            # Decode, or take it as-is if decoding fails.
            try:
                text = part.get_payload(decode=True)
            except:
                text = part.get_payload(decode=False)
                if text is not None:
                    text = try_to_repair_damaged_base64(text)
            if text is not None:
                id = self._find_id_in_text(text)
                return id
        return None

    header_pattern = re.escape(options["Headers", "mailid_header_name"])
    # A MUA might enclose the id in a table, thus the convoluted re pattern
    # (Mozilla Mail does this with inline html)
    header_pattern += r":\s*(\</th\>\s*\<td\>\s*)?([\d\-]+)"
    header_re = re.compile(header_pattern)

    def _find_id_in_text(self, text):
        mo = self.header_re.search(text)
        if mo is None:
            return None
        return mo.group(2)

    def train(self, msg, isSpam):
        try:
            use_cached = options["smtpproxy", "use_cached_message"]
        except KeyError:
            use_cached = True
        if use_cached:
            id = self.extractSpambayesID(msg)
            if id is None:
                print "Could not extract id"
                return
            self.train_cached_message(id, isSpam)
        # Otherwise, train on the forwarded/bounced message.
        msg = sbheadermessage_from_string(msg)
        id = msg.setIdFromPayload()
        msg.delSBHeaders()
        if id is None:
            # No id, so we don't have any reliable method of remembering
            # information about this message, so we just assume that it
            # hasn't been trained before.  We could generate some sort of
            # checksum for the message and use that as an id (this would
            # mean that we didn't need to store the id with the message)
            # but that might be a little unreliable.
            self.classifier.learn(msg.asTokens(), isSpam)
        else:
            if msg.GetTrained() == (not isSpam):
                self.classifier.unlearn(msg.asTokens(), not isSpam)
                msg.RememberTrained(None)
            if msg.GetTrained() is None:
                self.classifier.learn(msg.asTokens(), isSpam)
                msg.RememberTrained(isSpam)

    def train_cached_message(self, id, isSpam):
        if not self.train_message_in_pop3proxy_cache(id, isSpam) and \
           not self.train_message_on_imap_server(id, isSpam):
            print "Could not find message (%s); perhaps it was " + \
                  "deleted from the POP3Proxy cache or the IMAP " + \
                  "server.  This means that no training was done." % (id, )

    def train_message_in_pop3proxy_cache(self, id, isSpam):
        if self.state is None:
            return False
        sourceCorpus = None
        for corpus in [self.state.unknownCorpus, self.state.hamCorpus,
                       self.state.spamCorpus]:
            if corpus.get(id) is not None:
                sourceCorpus = corpus
                break
        if corpus is None:
            return False
        if isSpam == True:
            targetCorpus = self.state.spamCorpus
        else:
            targetCorpus = self.state.hamCorpus
        targetCorpus.takeMessage(id, sourceCorpus)
        self.classifier.store()

    def train_message_on_imap_server(self, id, isSpam):
        if self.imap is None:
            return False
        msg = self.imap.FindMessage(id)
        if msg is None:
            return False
        if msg.GetTrained() == (not isSpam):
            msg.get_substance()
            msg.delSBHeaders()
            self.classifier.unlearn(msg.asTokens(), not isSpam)
            msg.RememberTrained(None)
        if msg.GetTrained() is None:
            msg.get_substance()
            msg.delSBHeaders()
            self.classifier.learn(msg.asTokens(), isSpam)
            msg.RememberTrained(isSpam)

def LoadServerInfo():
    # Load the proxy settings
    servers = []
    proxyPorts = []
    if options["smtpproxy", "remote_servers"]:
        for server in options["smtpproxy", "remote_servers"]:
            server = server.strip()
            if server.find(':') > -1:
                server, port = server.split(':', 1)
            else:
                port = '25'
            servers.append((server, int(port)))
    if options["smtpproxy", "listen_ports"]:
        splitPorts = options["smtpproxy", "listen_ports"]
        proxyPorts = map(_addressAndPort, splitPorts)
    if len(servers) != len(proxyPorts):
        print "smtpproxy:remote_servers & smtpproxy:listen_ports are " + \
              "different lengths!"
        sys.exit()
    return servers, proxyPorts    

def CreateProxies(servers, proxyPorts, trainer):
    """Create BayesSMTPProxyListeners for all the given servers."""
    proxyListeners = []
    for (server, serverPort), proxyPort in zip(servers, proxyPorts):
        listener = BayesSMTPProxyListener(server, serverPort, proxyPort,
                                          trainer)
        proxyListeners.append(listener)
    return proxyListeners

Index: Corpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Corpus.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** Corpus.py	4 May 2003 03:05:33 -0000	1.7
--- Corpus.py	19 Sep 2003 23:38:10 -0000	1.8
***************
*** 6,10 ****
      Corpus - a collection of Messages
      ExpiryCorpus - a "young" Corpus
-     Message - a subject of Spambayes training
      MessageFactory - creates a Message
  
--- 6,9 ----
***************
*** 65,69 ****
  
      MessageFactory is a required factory class, because Corpus is
!     designed to do lazy initialization of messages and as an abstract
      class, must know how to create concrete instances of the correct
      class.
--- 64,68 ----
  
      MessageFactory is a required factory class, because Corpus is
!     designed to do lazy initialization of messages and, as an abstract
      class, must know how to create concrete instances of the correct
      class.
***************
*** 74,78 ****
      '''
  
! # This module is part of the spambayes project, which is Copyright 2002
  # The Python Software Foundation and is covered by the Python Software
  # Foundation license.
--- 73,77 ----
      '''
  
! # This module is part of the spambayes project, which is Copyright 2002-3
  # The Python Software Foundation and is covered by the Python Software
  # Foundation license.
***************
*** 115,119 ****
  
      def addObserver(self, observer):
!         '''Register an observer, which must implement
          onAddMessage, onRemoveMessage'''
  
--- 114,118 ----
  
      def addObserver(self, observer):
!         '''Register an observer, which should implement
          onAddMessage, onRemoveMessage'''
  
***************
*** 192,210 ****
          '''Move a Message from another corpus to this corpus'''
  
-         # XXX Hack: Calling msg.getSubstance() here ensures that the
-         # message substance is in memory.  If it isn't, when addMessage()
-         # calls message.store(), which calls message.getSubstance(), that
-         # will try to load the substance from the as-yet-unwritten new file.
          msg = fromcorpus[key]
!         msg.getSubstance()
          fromcorpus.removeMessage(msg)
          self.addMessage(msg)
  
      def get(self, key, default=None):
-         # the old version would never return the default,
-         # it would just create a new message, even if that
-         # message did not exist in the cache
-         # we need to check for the key in our msgs, but we can't check
-         # for None, because that signifies a non-cached message
          if self.msgs.get(key, "") is "":
              return default
--- 191,200 ----
          '''Move a Message from another corpus to this corpus'''
  
          msg = fromcorpus[key]
!         msg.load() # ensure that the substance has been loaded
          fromcorpus.removeMessage(msg)
          self.addMessage(msg)
  
      def get(self, key, default=None):
          if self.msgs.get(key, "") is "":
              return default
***************
*** 214,218 ****
      def __getitem__(self, key):
          '''Corpus is a dictionary'''
- 
          amsg = self.msgs.get(key)
  
--- 204,207 ----
***************
*** 225,234 ****
      def keys(self):
          '''Message keys in the Corpus'''
- 
          return self.msgs.keys()
  
      def __iter__(self):
          '''Corpus is iterable'''
- 
          for key in self.keys():
              try:
--- 214,221 ----
***************
*** 239,248 ****
      def __str__(self):
          '''Instance as a printable string'''
- 
          return self.__repr__()
  
      def __repr__(self):
          '''Instance as a representative string'''
- 
          raise NotImplementedError
  
--- 226,233 ----
***************
*** 261,265 ****
      def __init__(self, expireBefore):
          '''Constructor'''
- 
          self.expireBefore = expireBefore
  
--- 246,249 ----
***************
*** 274,424 ****
  
  
- class Message:
-     '''Abstract Message class'''
- 
-     def __init__(self):
-         '''Constructor()'''
- 
-         # The text of the message headers and body are held in attributes
-         # called 'hdrtxt' and 'payload', created on demand in __getattr__
-         # by calling load(), which should in turn call setSubstance().
-         # This means you don't need to remember to call load() before
-         # using these attributes.
- 
-     def __getattr__(self, attributeName):
-         '''On-demand loading of the message text.'''
- 
-         if attributeName in ('hdrtxt', 'payload'):
-             self.load()
-         try:
-             return self.__dict__[attributeName]
-         except KeyError:
-             raise AttributeError, attributeName
- 
-     def load(self):
-         '''Method to load headers and body'''
- 
-         raise NotImplementedError
- 
-     def store(self):
-         '''Method to persist a message'''
- 
-         raise NotImplementedError
- 
-     def remove(self):
-         '''Method to obliterate a message'''
- 
-         raise NotImplementedError
- 
-     def __repr__(self):
-         '''Instance as a representative string'''
- 
-         raise NotImplementedError
- 
-     def __str__(self):
-         '''Instance as a printable string'''
- 
-         return self.getSubstance()
- 
-     def name(self):
-         '''Message may have a unique human readable name'''
- 
-         return self.__repr__()
- 
-     def key(self):
-         '''The key for this instance'''
- 
-         raise NotImplementedError
- 
-     def setSubstance(self, sub):
-         '''set this message substance'''
- 
-         bodyRE = re.compile(r"\r?\n(\r?\n)(.*)", re.DOTALL+re.MULTILINE)
-         bmatch = bodyRE.search(sub)
-         if bmatch:
-             self.payload = bmatch.group(2)
-             self.hdrtxt = sub[:bmatch.start(2)]
-         else:
-             # malformed message - punt
-             self.payload = sub
-             self.hdrtxt = ""
- 
-     def getSubstance(self):
-         '''Return this message substance'''
- 
-         return self.hdrtxt + self.payload
- 
-     def setSpamprob(self, prob):
-         '''Score of the last spamprob calc, may not be persistent'''
- 
-         self.spamprob = prob
- 
-     def tokenize(self):
-         '''Returns substance as tokens'''
- 
-         return tokenizer.tokenize(self.getSubstance())
- 
-     def createTimeStamp(self):
-         '''Returns the create time of this message'''
-         # Should return a timestamp like time.time()
- 
-         raise NotImplementedError
- 
-     def getFrom(self):
-         '''Return a message From header content'''
- 
-         if self.hdrtxt:
-             match = re.search(r'^From:(.*)$', self.hdrtxt, re.MULTILINE)
-             return match.group(1)
-         else:
-             return None
- 
-     def getSubject(self):
-         '''Return a message Subject header contents'''
- 
-         if self.hdrtxt:
-             match = re.search(r'^Subject:(.*)$', self.hdrtxt, re.MULTILINE)
-             return match.group(1)
-         else:
-             return None
- 
-     def getDate(self):
-         '''Return a message Date header contents'''
- 
-         if self.hdrtxt:
-             match = re.search(r'^Date:(.*)$', self.hdrtxt, re.MULTILINE)
-             return match.group(1)
-         else:
-             return None
- 
-     def getHeadersList(self):
-         '''Return a list of message header tuples'''
- 
-         hdrregex = re.compile(r'^([A-Za-z0-9-_]*): ?(.*)$', re.MULTILINE)
-         data = re.sub(r'\r?\n\r?\s',' ',self.hdrtxt,re.MULTILINE)
-         match = hdrregex.findall(data)
- 
-         return match
- 
-     def getHeaders(self):
-         '''Return message headers as text'''
- 
-         return self.hdrtxt
- 
-     def getPayload(self):
-         '''Return the message body'''
- 
-         return self.payload
- 
-     def stripSBDHeader(self):
-         '''Removes the X-Spambayes-Disposition: header from the message'''
- 
-         # This is useful for training, where a spammer may be spoofing
-         # our header, to make sure that our header doesn't become an
-         # overweight clue to hamminess
- 
-         raise NotImplementedError
- 
- 
  class MessageFactory:
      '''Abstract Message Factory'''
--- 258,261 ----
***************
*** 430,434 ****
      def create(self, key):
          '''Create a message instance'''
- 
          raise NotImplementedError
  
--- 267,270 ----

Index: FileCorpus.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/FileCorpus.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** FileCorpus.py	16 Sep 2003 04:42:32 -0000	1.6
--- FileCorpus.py	19 Sep 2003 23:38:10 -0000	1.7
***************
*** 86,89 ****
--- 86,90 ----
  
  from spambayes import Corpus
+ from spambayes import message
  from spambayes import storage
  import sys, os, gzip, fnmatch, getopt, errno, time, stat
***************
*** 116,120 ****
          # to the corpus should for the moment be handled by a complete
          # retraining.
- 
          for filename in os.listdir(directory):
              if fnmatch.fnmatch(filename, filter):
--- 117,120 ----
***************
*** 123,134 ****
      def makeMessage(self, key):
          '''Ask our factory to make a Message'''
- 
          msg = self.factory.create(key, self.directory)
- 
          return msg
  
      def addMessage(self, message):
          '''Add a Message to this corpus'''
- 
          if not fnmatch.fnmatch(message.key(), self.filter):
              raise ValueError
--- 123,131 ----
***************
*** 145,149 ****
      def removeMessage(self, message):
          '''Remove a Message from this corpus'''
- 
          if options["globals", "verbose"]:
              print 'removing',message.key(),'from corpus'
--- 142,145 ----
***************
*** 187,206 ****
  
  
! class FileMessage(Corpus.Message):
      '''Message that persists as a file system artifact.'''
  
      def __init__(self,file_name, directory):
          '''Constructor(message file name, corpus directory name)'''
! 
!         Corpus.Message.__init__(self)
          self.file_name = file_name
          self.directory = directory
  
!         # No calling of self.load() here - that's done on demand by
!         # Message.__getattr__.
  
      def pathname(self):
          '''Derive the pathname of the message file'''
- 
          return os.path.join(self.directory, self.file_name)
  
--- 183,202 ----
  
  
! class FileMessage(message.SBHeaderMessage):
      '''Message that persists as a file system artifact.'''
  
      def __init__(self,file_name, directory):
          '''Constructor(message file name, corpus directory name)'''
!         message.SBHeaderMessage.__init__(self)
          self.file_name = file_name
          self.directory = directory
+         self.loaded = False
  
!     def as_string(self):
!         self.load() # ensure that the substance is loaded
!         return message.SBHeaderMessage.as_string(self)
  
      def pathname(self):
          '''Derive the pathname of the message file'''
          return os.path.join(self.directory, self.file_name)
  
***************
*** 215,218 ****
--- 211,217 ----
          # messages gzipped.  If someone can think of a classier (pun
          # intended) way of doing this, be my guest.
+         if self.loaded:
+             return
+ 
          if options["globals", "verbose"]:
              print 'loading', self.file_name
***************
*** 227,231 ****
          else:
              try:
!                 self.setSubstance(fp.read())
              except IOError, e:
                  if str(e) == 'Not a gzipped file':
--- 226,230 ----
          else:
              try:
!                 self.setPayload(fp.read())
              except IOError, e:
                  if str(e) == 'Not a gzipped file':
***************
*** 239,246 ****
                              raise
                      else:
!                         self.setSubstance(fp.read())
                          fp.close()
              else:
                  fp.close()
  
      def store(self):
--- 238,246 ----
                              raise
                      else:
!                         self.setPayload(fp.read())
                          fp.close()
              else:
                  fp.close()
+         self.loaded = True
  
      def store(self):
***************
*** 250,258 ****
              print 'storing', self.file_name
  
!         pn = self.pathname()
!         fp = open(pn, 'wb')
!         fp.write(self.getSubstance())
          fp.close()
  
      def remove(self):
          '''Message hara-kiri'''
--- 250,261 ----
              print 'storing', self.file_name
  
!         fp = open(self.pathname(), 'wb')
!         fp.write(self.as_string())
          fp.close()
  
+     def setPayload(self, payload):
+         self.loaded = True
+         message.SBHeaderMessage.setPayload(self, payload)
+ 
      def remove(self):
          '''Message hara-kiri'''
***************
*** 275,287 ****
  
          elip = ''
!         sub = self.getSubstance()
  
!         if options["globals", "verbose"]:
!             sub = self.getSubstance()
!         else:
              if len(sub) > 20:
-                 sub = sub[:20]
                  if len(sub) > 40:
!                     sub += '...' + sub[-20:]
  
          pn = os.path.join(self.directory, self.file_name)
--- 278,289 ----
  
          elip = ''
!         sub = self.as_string()
  
!         if not options["globals", "verbose"]:
              if len(sub) > 20:
                  if len(sub) > 40:
!                     sub = sub[:20] + '...' + sub[-20:]
!                 else:
!                     sub = sub[:20]
  
          pn = os.path.join(self.directory, self.file_name)
***************
*** 293,297 ****
      def __str__(self):
          '''Instance as a printable string'''
- 
          return self.__repr__()
  
--- 295,298 ----
***************
*** 330,334 ****
          pn = self.pathname()
          gz = gzip.open(pn, 'wb')
!         gz.write(self.getSubstance())
          gz.flush()
          gz.close()
--- 331,335 ----
          pn = self.pathname()
          gz = gzip.open(pn, 'wb')
!         gz.write(self.as_string())
          gz.flush()
          gz.close()
***************
*** 390,394 ****
  
      m1 = fmClass('XMG00001', 'fctestspamcorpus')
!     m1.setSubstance(testmsg2())
  
      print '\n\nAdd a message to hamcorpus that does not match the filter'
--- 391,395 ----
  
      m1 = fmClass('XMG00001', 'fctestspamcorpus')
!     m1.setPayload(testmsg2())
  
      print '\n\nAdd a message to hamcorpus that does not match the filter'
***************
*** 445,457 ****
      msg = spamcorpus['MSG00001']
      print msg
-     print '\n\nThis is some vital information in the message'
-     print 'Date header is',msg.getDate()
-     print 'Subject header is',msg.getSubject()
-     print 'From header is',msg.getFrom()
- 
-     print 'Header text is:',msg.getHeaders()
-     print 'Headers are:',msg.getHeadersList()
-     print 'Body is:',msg.getPayload()
- 
  
  
--- 446,449 ----
***************
*** 551,563 ****
  
      m1 = fmClass('MSG00001', 'fctestspamcorpus')
!     m1.setSubstance(tm1)
      m1.store()
  
      m2 = fmClass('MSG00002', 'fctestspamcorpus')
!     m2.setSubstance(tm2)
      m2.store()
  
      m3 = fmClass('MSG00003', 'fctestunsurecorpus')
!     m3.setSubstance(tm1)
      m3.store()
  
--- 543,555 ----
  
      m1 = fmClass('MSG00001', 'fctestspamcorpus')
!     m1.setPayload(tm1)
      m1.store()
  
      m2 = fmClass('MSG00002', 'fctestspamcorpus')
!     m2.setPayload(tm2)
      m2.store()
  
      m3 = fmClass('MSG00003', 'fctestunsurecorpus')
!     m3.setPayload(tm1)
      m3.store()
  
***************
*** 571,583 ****
  
      m4 = fmClass('MSG00004', 'fctestunsurecorpus')
!     m4.setSubstance(tm1)
      m4.store()
  
      m5 = fmClass('MSG00005', 'fctestunsurecorpus')
!     m5.setSubstance(tm2)
      m5.store()
  
      m6 = fmClass('MSG00006', 'fctestunsurecorpus')
!     m6.setSubstance(tm2)
      m6.store()
  
--- 563,575 ----
  
      m4 = fmClass('MSG00004', 'fctestunsurecorpus')
!     m4.setPayload(tm1)
      m4.store()
  
      m5 = fmClass('MSG00005', 'fctestunsurecorpus')
!     m5.setPayload(tm2)
      m5.store()
  
      m6 = fmClass('MSG00006', 'fctestunsurecorpus')
!     m6.setPayload(tm2)
      m6.store()
  
***************
*** 693,697 ****
  
  if __name__ == '__main__':
- 
      try:
          opts, args = getopt.getopt(sys.argv[1:], 'estgvhcu')
--- 685,688 ----

Index: ImapUI.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/ImapUI.py,v
retrieving revision 1.18
retrieving revision 1.19
diff -C2 -d -r1.18 -r1.19
*** ImapUI.py	18 Sep 2003 03:58:59 -0000	1.18
--- ImapUI.py	19 Sep 2003 23:38:10 -0000	1.19
***************
*** 103,106 ****
--- 103,108 ----
      ('Tokenizer',           'summarize_email_prefixes'),
      ('Tokenizer',           'summarize_email_suffixes'),
+     ('Interface Options',   None),
+     ('html_ui',             'display_adv_find'),
  )
  
***************
*** 134,143 ****
          filter</a><br />and <a
          href='trainingfolders'>Configure folders to train</a>"""
          content = (self._buildBox('Status and Configuration',
                                    'status.gif', statusTable % stateDict)+
                     self._buildTrainBox() +
                     self._buildClassifyBox() +
!                    self._buildBox('Word query', 'query.gif',
!                                   self.html.wordQuery)
                     )
          self._writePreamble("Home")
--- 136,148 ----
          filter</a><br />and <a
          href='trainingfolders'>Configure folders to train</a>"""
+         findBox = self._buildBox('Word query', 'query.gif',
+                                  self.html.wordQuery)
+         if not options["html_ui", "display_adv_find"]:
+             del findBox.advanced
          content = (self._buildBox('Status and Configuration',
                                    'status.gif', statusTable % stateDict)+
                     self._buildTrainBox() +
                     self._buildClassifyBox() +
!                    findBox
                     )
          self._writePreamble("Home")

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.79
retrieving revision 1.80
diff -C2 -d -r1.79 -r1.80
*** Options.py	18 Sep 2003 13:55:11 -0000	1.79
--- Options.py	19 Sep 2003 23:38:10 -0000	1.80
***************
*** 759,770 ****
       IP_LIST, RESTORE),
  
!     ("display_to", "Display To: in message review", False,
       """When reviewing messages via the web user interface, you are
!      presented with the message subject, the address the message is
!      from, and its classification.  If you set this option, you will
!      also be shown the address the message was to.  This might be
!      useful if you receive mail from multiple accounts, or if you
!      want to quickly identify mail received via a mailing list.""",
       BOOLEAN, RESTORE),
  
      ("http_authentication", "HTTP Authentication", "None",
--- 759,811 ----
       IP_LIST, RESTORE),
  
!     ("display_headers", "Headers to display in message review", ("Subject", "From"),
       """When reviewing messages via the web user interface, you are
!      presented with various information about the message.  By default, you
!      are shown the subject and who the message is from.  You can add other
!      message headers to display, however, such as the address the message
!      is to, or the date that the message was sent.""",
!      HEADER_NAME, RESTORE),
! 
!     ("display_received_time", "Display date received in message review", False,
!      """When reviewing messages via the web user interface, you are
!      presented with various information about the message.  If you set
!      this option, you will be shown the date that the message was received.
!      """,
       BOOLEAN, RESTORE),
+ 
+     ("display_score", "Display score in message review", False,
+      """When reviewing messages via the web user interface, you are
+      presented with various information about the message.  If you
+      set this option, this information will include the score that
+      the message received when it was classified.  You might wish to
+      see this purely out of curiousity, or you might wish to only
+      train on messages that score towards the boundaries of the
+      classification areas.  Note that in order to use this option,
+      you must also enable the option to include the score in the
+      message headers.""",
+      BOOLEAN, RESTORE),
+ 
+     ("display_adv_find", "Display the advanced find query", False,
+      """Present advanced options in the 'Word Query' box on the front page,
+      including wildcard and regular expression searching.""",
+      BOOLEAN, RESTORE),
+ 
+     ("default_ham_action", "Default training for ham", "ham",
+      """When presented with the review list in the web interface,
+      which button would you like checked by default when the message
+      is classified as ham?""",
+      ("ham", "spam", "discard", "defer"), RESTORE),
+ 
+     ("default_spam_action", "Default training for spam", "spam",
+      """When presented with the review list in the web interface,
+      which button would you like checked by default when the message
+      is classified as spam?""",
+      ("ham", "spam", "discard", "defer"), RESTORE),
+ 
+     ("default_unsure_action", "Default training for unsure", "defer",
+      """When presented with the review list in the web interface,
+      which button would you like checked by default when the message
+      is classified as unsure?""",
+      ("ham", "spam", "discard", "defer"), RESTORE),
  
      ("http_authentication", "HTTP Authentication", "None",

Index: ProxyUI.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/ProxyUI.py,v
retrieving revision 1.23
retrieving revision 1.24
diff -C2 -d -r1.23 -r1.24
*** ProxyUI.py	18 Sep 2003 13:55:11 -0000	1.23
--- ProxyUI.py	19 Sep 2003 23:38:10 -0000	1.24
***************
*** 28,37 ****
  
   o Review already-trained messages, and purge them.
!  o Put in a link to view a message (plain text, html, multipart...?)
!    Include a Reply link that launches the registered email client, eg.
!    mailto:tim at fourstonesExpressions.com?subject=Re:%20pop3proxy&body=Hi%21%0D
!  o [Francois Granger] Show the raw spambrob number close to the buttons
!    (this would mean using the extra X-Hammie header by default).
!  o Add Today and Refresh buttons on the Review page.
  
  User interface improvements:
--- 28,32 ----
  
   o Review already-trained messages, and purge them.
!  o Add a Today button on the Review page.
  
  User interface improvements:
***************
*** 39,49 ****
   o Can it cleanly dynamically update its status display while having a POP3
     conversation?  Hammering reload sucks.
-  o Have both the trained evidence (if present) and current evidence on the
-    show clues page.
  
   o Suggestions?
  """
  
! # This module is part of the spambayes project, which is Copyright 2002
  # The Python Software Foundation and is covered by the Python Software
  # Foundation license.
--- 34,42 ----
   o Can it cleanly dynamically update its status display while having a POP3
     conversation?  Hammering reload sucks.
  
   o Suggestions?
  """
  
! # This module is part of the spambayes project, which is Copyright 2002-3
  # The Python Software Foundation and is covered by the Python Software
  # Foundation license.
***************
*** 62,68 ****
  
  import re
  import time
  import bisect
! import cgi
  
  import tokenizer
--- 55,67 ----
  
  import re
+ import cgi
  import time
+ import types
  import bisect
! 
! try:
!     from sets import Set
! except ImportError:
!     from compatsets import Set
  
  import tokenizer
***************
*** 83,94 ****
      ('pop3proxy',           'remote_servers'),
      ('pop3proxy',           'listen_ports'),
-     ('html_ui',             'display_to'),
-     ('html_ui',             'allow_remote_connections'),
-     ('html_ui',             'http_authentication'),
-     ('html_ui',             'http_user_name'),
-     ('html_ui',             'http_password'),
-     ('Header Options',      None),
-     ('Headers',             'notate_to'),
-     ('Headers',             'notate_subject'),
      ('SMTP Proxy Options',  None),
      ('smtpproxy',           'remote_servers'),
--- 82,85 ----
***************
*** 97,100 ****
--- 88,94 ----
      ('smtpproxy',           'spam_address'),
      ('smtpproxy',           'use_cached_message'),
+     ('Header Options',      None),
+     ('Headers',           'notate_to'),
+     ('Headers',           'notate_subject'),
      ('Storage Options',  None),
      ('Storage',             'persistent_storage_file'),
***************
*** 138,141 ****
--- 132,149 ----
      ('Tokenizer',           'summarize_email_prefixes'),
      ('Tokenizer',           'summarize_email_suffixes'),
+     ('Training Options',   None),
+     ('Hammie',              'train_on_filter'),
+     ('Interface Options',   None),
+     ('html_ui',             'display_headers'),
+     ('html_ui',             'display_received_time'),
+     ('html_ui',             'display_score'),
+     ('html_ui',             'display_adv_find'),
+     ('html_ui',             'default_ham_action'),
+     ('html_ui',             'default_spam_action'),
+     ('html_ui',             'default_unsure_action'),
+     ('html_ui',             'allow_remote_connections'),
+     ('html_ui',             'http_authentication'),
+     ('html_ui',             'http_user_name'),
+     ('html_ui',             'http_password'),
  )
  
***************
*** 150,153 ****
--- 158,162 ----
          self.state_recreator = state_recreator # ugly
          self.app_for_version = "POP3 Proxy"
+         self.previous_sort = None
  
      def onHome(self):
***************
*** 158,161 ****
--- 167,174 ----
          if not state.servers:
              statusTable.proxyDetails = "No POP3 proxies running.<br/>"
+         findBox = self._buildBox('Word query', 'query.gif',
+                                  self.html.wordQuery)
+         if not options["html_ui", "display_adv_find"]:
+             del findBox.advanced
          content = (self._buildBox('Status and Configuration',
                                    'status.gif', statusTable % stateDict)+
***************
*** 164,169 ****
                     self._buildTrainBox() +
                     self._buildClassifyBox() +
!                    self._buildBox('Word query', 'query.gif',
!                                   self.html.wordQuery) +
                     self._buildBox('Find message', 'query.gif',
                                    self.html.findMessage)
--- 177,181 ----
                     self._buildTrainBox() +
                     self._buildClassifyBox() +
!                    findBox +
                     self._buildBox('Find message', 'query.gif',
                                    self.html.findMessage)
***************
*** 183,187 ****
              messageName = state.getNewMessageName()
              message = state.unknownCorpus.makeMessage(messageName)
!             message.setSubstance(m)
              state.unknownCorpus.addMessage(message)
  
--- 195,199 ----
              messageName = state.getNewMessageName()
              message = state.unknownCorpus.makeMessage(messageName)
!             message.setPayload(m)
              state.unknownCorpus.addMessage(message)
  
***************
*** 213,219 ****
          page or zero if there isn't one, likewise the start of the given page,
          and likewise the start of the next page."""
!         # Fetch all the message keys and sort them into timestamp order.
          allKeys = state.unknownCorpus.keys()
-         allKeys.sort()
  
          # The default start timestamp is derived from the most recent message,
--- 225,230 ----
          page or zero if there isn't one, likewise the start of the given page,
          and likewise the start of the next page."""
!         # Fetch all the message keys
          allKeys = state.unknownCorpus.keys()
  
          # The default start timestamp is derived from the most recent message,
***************
*** 244,271 ****
          return keys, date, prior, start, end
  
!     def _appendMessages(self, table, keyedMessageInfo, label):
          """Appends the rows of a table of messages to 'table'."""
          stripe = 0
!         if not options["html_ui", "display_to"]:
!             del table.to_header
!         nrows = options["html_ui", "rows_per_section"]
!         for key, messageInfo in keyedMessageInfo[:nrows]:
              row = self.html.reviewRow.clone()
              if label == 'Spam':
!                 row.spam.checked = 1
              elif label == 'Ham':
!                 row.ham.checked = 1
              else:
!                 row.defer.checked = 1
!             row.subject = messageInfo.subjectHeader
!             row.subject.title = messageInfo.bodySummary
!             row.subject.href="view?key=%s&corpus=%s" % (key, label)
!             row.from_ = messageInfo.fromHeader
!             if options["html_ui", "display_to"]:
!                 row.to_ = messageInfo.toHeader
              else:
!                 del row.to_
!             subj = cgi.escape(messageInfo.subjectHeader)
              row.classify.href="showclues?key=%s&subject=%s" % (key, subj)
              setattr(row, 'class', ['stripe_on', 'stripe_off'][stripe]) # Grr!
              row = str(row).replace('TYPE', label).replace('KEY', key)
--- 255,331 ----
          return keys, date, prior, start, end
  
!     def _sortMessages(self, messages, sort_order):
!         """Sorts the message by the appropriate attribute.  If this was the
!         previous sort order, then reverse it."""
!         if sort_order is None or sort_order == "received":
!             # Default sorting, which is in reverse order of appearance.
!             # This is complicated because the 'received' info is the key.
!             messages.sort()
!             if self.previous_sort == sort_order:
!                 messages.reverse()
!                 self.previous_sort = None
!             else:
!                 self.previous_sort = 'received'
!             return messages
!         else:
!             tmplist = [(getattr(x[1], sort_order), x) for x in messages]
!         tmplist.sort()
!         if self.previous_sort == sort_order:
!             tmplist.reverse()
!             self.previous_sort = None
!         else:
!             self.previous_sort = sort_order
!         return [x for (key, x) in tmplist]
! 
!     def _appendMessages(self, table, keyedMessageInfo, label, sort_order):
          """Appends the rows of a table of messages to 'table'."""
          stripe = 0
! 
!         keyedMessageInfo = self._sortMessages(keyedMessageInfo, sort_order)
!         for key, messageInfo in keyedMessageInfo:
!             unused, unused, messageInfo.received = \
!                     self._getTimeRange(self._keyToTimestamp(key))
              row = self.html.reviewRow.clone()
              if label == 'Spam':
!                 r_att = getattr(row, options["html_ui",
!                                            "default_spam_action"])
              elif label == 'Ham':
!                 r_att = getattr(row, options["html_ui",
!                                            "default_ham_action"])
              else:
!                 r_att = getattr(row, options["html_ui",
!                                            "default_unsure_action"])
!             setattr(r_att, "checked", 1)
! 
!             row.optionalHeadersValues = '' # make way for real list
!             for header in options["html_ui", "display_headers"]:
!                 header = header.lower()
!                 text = getattr(messageInfo, "%sHeader" % (header,))
!                 if header == "subject":
!                     # Subject is special, because it links to the body.
!                     # If the user doesn't display the subject, then there
!                     # is no link to the body.
!                     h = self.html.reviewRow.linkedHeaderValue.clone()
!                     h.text.title = messageInfo.bodySummary
!                     h.text.href = "view?key=%s&corpus=%s" % (key, label)
!                 else:
!                     h = self.html.reviewRow.headerValue.clone()
!                 h.text = text
!                 row.optionalHeadersValues += h
! 
!             # Apart from any message headers, we may also wish to display
!             # the message score, and the time the message was received.
!             if options["html_ui", "display_score"]:
!                 row.score_ = messageInfo.score
              else:
!                 del row.score_
!             if options["html_ui", "display_received_time"]:
!                 row.received_ = messageInfo.received
!             else:
!                 del row.received_
! 
!             subj = messageInfo.subjectHeader
              row.classify.href="showclues?key=%s&subject=%s" % (key, subj)
+             row.tokens.href="showclues?key=%s&subject=%s&tokens=1" % (key, subj)
              setattr(row, 'class', ['stripe_on', 'stripe_off'][stripe]) # Grr!
              row = str(row).replace('TYPE', label).replace('KEY', key)
***************
*** 350,381 ****
  
          # Else if an id has been specified, just show that message
          elif params.get('find') is not None:
              key = params['find']
              error = False
              if key == "":
                  error = True
!                 page = "<p>You must enter an id to find.</p>"
!             elif state.unknownCorpus.get(key) == None:
!                 # maybe this message has been moved to the spam
!                 # or ham corpus
!                 if state.hamCorpus.get(key) != None:
!                     sourceCorpus = state.hamCorpus
!                 elif state.spamCorpus.get(key) != None:
!                     sourceCorpus = state.spamCorpus
                  else:
!                     error = True
!                     page = "<p>Could not find message with id '"
!                     page += key + "' - maybe it expired.</p>"
!             if error == True:
!                 title = "Did not find message"
!                 box = self._buildBox(title, 'status.gif', page)
!                 self.write(box)
!                 self.write(self._buildBox('Find message', 'query.gif',
!                                           self.html.findMessage))
!                 self._writePostamble()
!                 return
!             keys.append(params['find'])
!             prior = this = next = 0
!             title = "Found message"
  
          # Else show the most recent day's page, as decided by _buildReviewKeys.
--- 410,479 ----
  
          # Else if an id has been specified, just show that message
+         # Else if search criteria have been specified, show the messages
+         # that match those criteria.
          elif params.get('find') is not None:
+             prior = this = next = 0
+             keys = Set()        # so we don't end up with duplicates
+             push = keys.add
+             try:
+                 max_results = int(params['max_results'])
+             except ValueError:
+                 max_results = 1
              key = params['find']
+             if params.has_key('ignore_case'):
+                 ic = True
+             else:
+                 ic = False
              error = False
              if key == "":
                  error = True
!                 page = "<p>You must enter a search string.</p>"
!             else:
!                 if len(keys) < max_results and \
!                    params.has_key('id'):
!                     if state.unknownCorpus.get(key):
!                         push((key, state.unknownCorpus))
!                     elif state.hamCorpus.get(key):
!                         push((key, state.hamCorpus))
!                     elif state.spamCorpus.get(key):
!                         push((key, state.spamCorpus))
!                 if params.has_key('subject') or params.has_key('body') or \
!                    params.has_key('headers'):
!                     # This is an expensive operation, so let the user know
!                     # that something is happening.
!                     self.write('<p>Searching...</p>')
!                     for corp in [state.unknownCorpus, state.hamCorpus,
!                                    state.spamCorpus]:
!                         for k in corp.keys():
!                             if len(keys) >= max_results:
!                                 break
!                             msg = corp[k]
!                             msg.load()
!                             if params.has_key('subject'):
!                                 if self._contains(msg['Subject'], key, ic):
!                                     push((k, corp))
!                             if params.has_key('body'):
!                                 msg_body = msg.as_string()
!                                 msg_body = msg_body[msg_body.index('\r\n\r\n'):]
!                                 if self._contains(msg_body, key, ic):
!                                     push((k, corp))
!                             if params.has_key('headers'):
!                                 for nm, val in msg.items():
!                                     if self._contains(nm, key, ic) or \
!                                        self._contains(val, key, ic):
!                                         push((k, corp))
!                 if len(keys):
!                     title = "Found message%s" % (['','s'][len(keys)>1],)
!                     keys = list(keys)
                  else:
!                     page = "<p>Could not find any matching messages. " \
!                            "Maybe they expired?</p>"
!                     title = "Did not find message"
!                     box = self._buildBox(title, 'status.gif', page)
!                     self.write(box)
!                     self.write(self._buildBox('Find message', 'query.gif',
!                                               self.html.findMessage))
!                     self._writePostamble()
!                     return
  
          # Else show the most recent day's page, as decided by _buildReviewKeys.
***************
*** 391,398 ****
                              }
          for key in keys:
              # Parse the message, get the judgement header and build a message
              # info object for each message.
!             cachedMessage = sourceCorpus[key]
!             message = spambayes.mboxutils.get_message(cachedMessage.getSubstance())
              judgement = message[options["Headers",
                                          "classification_header_name"]]
--- 489,500 ----
                              }
          for key in keys:
+             if isinstance(key, types.TupleType):
+                 key, sourceCorpus = key
+             else:
+                 sourceCorpus = state.unknownCorpus
              # Parse the message, get the judgement header and build a message
              # info object for each message.
!             message = sourceCorpus[key]
!             message.load()
              judgement = message[options["Headers",
                                          "classification_header_name"]]
***************
*** 405,409 ****
  
          # Present the list of messages in their groups in reverse order of
!         # appearance.
          if keys:
              page = self.html.reviewtable.clone()
--- 507,511 ----
  
          # Present the list of messages in their groups in reverse order of
!         # appearance, by default, or according to the specified sort order.
          if keys:
              page = self.html.reviewtable.clone()
***************
*** 415,418 ****
--- 517,521 ----
                  del page.nextButton.disabled
              templateRow = page.reviewRow.clone()
+ 
              page.table = ""  # To make way for the real rows.
              for header, label in ((options["Headers",
***************
*** 424,432 ****
                  messages = keyedMessageInfo[header]
                  if messages:
!                     subHeader = str(self.html.reviewSubHeader)
                      subHeader = subHeader.replace('TYPE', label)
                      page.table += self.html.blankRow
                      page.table += subHeader
!                     self._appendMessages(page.table, messages, label)
  
              page.table += self.html.trainRow
--- 527,549 ----
                  messages = keyedMessageInfo[header]
                  if messages:
!                     sh = self.html.reviewSubHeader.clone()
!                     # Setup the header row
!                     sh.optionalHeaders = ''
!                     h = self.html.headerHeader.clone()
!                     for header in options["html_ui", "display_headers"]:
!                         h.headerLink.href = 'review?sort=%sHeader' % \
!                                             (header.lower(),)
!                         h.headerName = header.title()
!                         sh.optionalHeaders += h
!                     if not options["html_ui", "display_score"]:
!                         del sh.score_header
!                     if not options["html_ui", "display_received_time"]:
!                         del sh.received_header
!                     subHeader = str(sh)
                      subHeader = subHeader.replace('TYPE', label)
                      page.table += self.html.blankRow
                      page.table += subHeader
!                     self._appendMessages(page.table, messages, label,
!                                          params.get('sort'))
  
              page.table += self.html.trainRow
***************
*** 444,447 ****
--- 561,573 ----
          self._writePostamble()
  
+     def _contains(self, a, b, ignore_case=False):
+         """Return true if substring b is part of string a."""
+         assert(isinstance(a, types.StringTypes))
+         assert(isinstance(b, types.StringTypes))
+         if ignore_case:
+             a = a.lower()
+             b = b.lower()
+         return a.find(b) >= 0
+ 
      def onView(self, key, corpus):
          """View a message - linked from the Review page."""
***************
*** 449,464 ****
          message = state.unknownCorpus.get(key)
          if message:
!             self.write("<pre>%s</pre>" % cgi.escape(message.getSubstance()))
          else:
              self.write("<p>Can't find message %r. Maybe it expired.</p>" % key)
          self._writePostamble()
  
!     def onShowclues(self, key, subject):
          """Show clues for a message - linked from the Review page."""
          self._writePreamble("Message clues", parent=('review', 'Review'))
!         message = state.unknownCorpus.get(key).getSubstance()
          message = message.replace('\r\n', '\n').replace('\r', '\n') # For Macs
          if message:
!             results = self._buildCluesTable(message, subject)
              del results.classifyAnother
              self.write(results)
--- 575,591 ----
          message = state.unknownCorpus.get(key)
          if message:
!             self.write("<pre>%s</pre>" % cgi.escape(message.as_string()))
          else:
              self.write("<p>Can't find message %r. Maybe it expired.</p>" % key)
          self._writePostamble()
  
!     def onShowclues(self, key, subject, tokens='0'):
          """Show clues for a message - linked from the Review page."""
+         tokens = bool(int(tokens)) # needs the int, as bool('0') is True
          self._writePreamble("Message clues", parent=('review', 'Review'))
!         message = state.unknownCorpus.get(key).as_string()
          message = message.replace('\r\n', '\n').replace('\r', '\n') # For Macs
          if message:
!             results = self._buildCluesTable(message, subject, tokens)
              del results.classifyAnother
              self.write(results)
***************
*** 469,478 ****
      def _makeMessageInfo(self, message):
          """Given an email.Message, return an object with subjectHeader,
!         fromHeader and bodySummary attributes.  These objects are passed into
!         appendMessages by onReview - passing email.Message objects directly
!         uses too much memory."""
          subjectHeader = message["Subject"] or "(none)"
!         fromHeader = message["From"] or "(none)"
!         toHeader = message["To"] or "(none)"
          try:
              part = typed_subpart_iterator(message, 'text', 'plain').next()
--- 596,625 ----
      def _makeMessageInfo(self, message):
          """Given an email.Message, return an object with subjectHeader,
!         bodySummary and other header (as needed) attributes.  These objects
!         are passed into appendMessages by onReview - passing email.Message
!         objects directly uses too much memory."""
          subjectHeader = message["Subject"] or "(none)"
!         headers = {"subject" : subjectHeader}
!         for header in options["html_ui", "display_headers"]:
!             headers[header.lower()] = (message[header] or "(none)")
!         score = message[options["Headers", "score_header_name"]]
!         if score:
!             # the score might have the log info at the end
!             op = score.find('(')
!             if op >= 0:
!                 score = score[:op]
!             try:
!                 score = "%.2f%%" % (float(score)*100,)
!             except ValueError:
!                 # Hmm.  The score header should only contain a floating
!                 # point number.  What's going on here, then?
!                 score = "Err"  # Let the user know something is wrong.
!         else:
!             # If the lookup fails, this means that the "include_score"
!             # option isn't activated. We have the choice here to either
!             # calculate it now, which is pretty inefficient, since we have
!             # already done so, or to admit that we don't know what it is.
!             # We'll go with the latter.
!             score = "?"
          try:
              part = typed_subpart_iterator(message, 'text', 'plain').next()
***************
*** 500,506 ****
              pass
          messageInfo = _MessageInfo()
!         messageInfo.subjectHeader = self._trimHeader(subjectHeader, 50, True)
!         messageInfo.fromHeader = self._trimHeader(fromHeader, 40, True)
!         messageInfo.toHeader = self._trimHeader(toHeader, 40, True)
          messageInfo.bodySummary = self._trimHeader(text, 200)
          return messageInfo
--- 647,654 ----
              pass
          messageInfo = _MessageInfo()
!         for headerName, headerValue in headers.items():
!             headerValue = self._trimHeader(headerValue, 45, True)
!             setattr(messageInfo, "%sHeader" % (headerName,), headerValue)
!         messageInfo.score = score
          messageInfo.bodySummary = self._trimHeader(text, 200)
          return messageInfo

Index: UserInterface.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/UserInterface.py,v
retrieving revision 1.24
retrieving revision 1.25
diff -C2 -d -r1.24 -r1.25
*** UserInterface.py	8 Sep 2003 07:04:17 -0000	1.24
--- UserInterface.py	19 Sep 2003 23:38:10 -0000	1.25
***************
*** 262,355 ****
          self._writePostamble()
  
!     def _buildCluesTable(self, message, subject=None):
          cluesTable = self.html.cluesTable.clone()
          cluesRow = cluesTable.cluesRow.clone()
          del cluesTable.cluesRow   # Delete dummy row to make way for real ones
!         (probability, clues) = classifier.spamprob(tokenizer.tokenize(message),\
!                                                     evidence=True)
          for word, wordProb in clues:
!             cluesTable += cluesRow % (cgi.escape(word), wordProb)
  
          results = self.html.classifyResults.clone()
!         results.probability = probability
          if subject is None:
!             heading = "Clues:"
          else:
!             heading = "Clues for: " + subject
          results.cluesBox = self._buildBox(heading, 'status.gif', cluesTable)
          return results
  
!     def onWordquery(self, word):
!         wildcard_limit = 10
!         statsBoxes = []
          if word == "":
!             stats = "You must enter a word."
!             statsBoxes.append(self._buildBox("Statistics for %r" % \
!                                              cgi.escape(word),
!                                              'status.gif', stats))
          else:
!             word = word.lower()
!             if word[-1] == '*':
!                 # Wildcard search - list all words that start with word[:-1]
!                 word = word[:-1]
!                 reached_limit = False
!                 for w in classifier._wordinfokeys():
!                     if not reached_limit and len(statsBoxes) > wildcard_limit:
!                         reached_limit = True
!                         over_limit = 0
!                     if w.startswith(word):
!                         if reached_limit:
!                             over_limit += 1
!                         else:
!                             wordinfo = classifier._wordinfoget(w)
!                             stats = self.html.wordStats.clone()
!                             stats.spamcount = wordinfo.spamcount
!                             stats.hamcount = wordinfo.hamcount
!                             stats.spamprob = classifier.probability(wordinfo)
!                             box = self._buildBox("Statistics for %r" % \
!                                                  cgi.escape(w),
!                                                  'status.gif', stats)
!                             statsBoxes.append(box)
!                 if len(statsBoxes) == 0:
!                     stats = "There are no words that begin with '%s' " \
!                             "in the database." % (word,)
!                     # We build a box for each word; I'm not sure this is
!                     # produces the nicest results, but it's ok with a
!                     # limited number of words.
!                     statsBoxes.append(self._buildBox("Statistics for %s" % \
!                                                      cgi.escape(word),
!                                                      'status.gif', stats))
!                 elif reached_limit:
!                     if over_limit == 1:
!                         singles = ["was", "match", "is"]
                      else:
!                         singles = ["were", "matches", "are"]
!                     stats = "There %s %d additional %s that %s not " \
!                             "shown here." % (singles[0], over_limit,
!                                              singles[1], singles[2])
!                     box = self._buildBox("Statistics for '%s*'" % \
!                                          cgi.escape(word), 'status.gif',
!                                          stats)
!                     statsBoxes.append(box)
!             else:
!                 # Optimised version for non-wildcard searches
!                 wordinfo = classifier._wordinfoget(word)
!                 if wordinfo:
!                     stats = self.html.wordStats.clone()
!                     stats.spamcount = wordinfo.spamcount
!                     stats.hamcount = wordinfo.hamcount
!                     stats.spamprob = classifier.probability(wordinfo)
                  else:
!                     stats = "%r does not exist in the database." % cgi.escape(word)
!                 statsBoxes.append(self._buildBox("Statistics for %r" % \
!                                                  cgi.escape(word),
!                                                  'status.gif', stats))
  
-         query = self.html.wordQuery.clone()
-         query.word.value = "%s" % (word,)
-         queryBox = self._buildBox("Word query", 'query.gif', query)
          self._writePreamble("Word query")
!         for box in statsBoxes:
!             self.write(box)
          self.write(queryBox)
          self._writePostamble()
--- 262,464 ----
          self._writePostamble()
  
!     ev_re = re.compile("%s:(.*?)(?:\n\S|\n\n)" % \
!                        re.escape(options["Headers",
!                                          "evidence_header_name"]),
!                        re.DOTALL)
!     sc_re = re.compile("%s:(.*)\n" % \
!                        re.escape(options["Headers", "score_header_name"]))
! 
!     def _fillCluesTable(self, clues):
!         accuracy = 6
          cluesTable = self.html.cluesTable.clone()
          cluesRow = cluesTable.cluesRow.clone()
          del cluesTable.cluesRow   # Delete dummy row to make way for real ones
!         fetchword = classifier._wordinfoget
          for word, wordProb in clues:
!             record = fetchword(word)
!             if record:
!                 nham = record.hamcount
!                 nspam = record.spamcount
!                 if wordProb is None:
!                     wordProb = classifier.probability(record)
!             elif word != "*H*" and word != "*S*":
!                 nham = nspam = 0
!             else:
!                 nham = nspam = "-"
!             if wordProb is None:
!                 wordProb = "-"
!             else:
!                 wordProb = round(float(wordProb), accuracy)
!             cluesTable += cluesRow % (cgi.escape(word), wordProb,
!                                       nham, nspam)
!         return cluesTable
!     
!     def _buildCluesTable(self, message, subject=None, show_tokens=False):
!         tokens = tokenizer.tokenize(message)
!         if show_tokens:
!             clues = []
!             for tok in tokens:
!                 clues.append((tok, None))
!             probability = classifier.spamprob(tokens)
!             cluesTable = self._fillCluesTable(clues)
!             head_name = "Tokens"
!         else:
!             (probability, clues) = classifier.spamprob(tokens, evidence=True)
!             cluesTable = self._fillCluesTable(clues)
!             head_name = "Clues"
  
          results = self.html.classifyResults.clone()
!         results.probability = "%.2f%% (%s)" % (probability*100, probability)
          if subject is None:
!             heading = "%s: (%s)" % (head_name, len(clues))
          else:
!             heading = "%s for: %s (%s)" % (head_name, subject, len(clues))
          results.cluesBox = self._buildBox(heading, 'status.gif', cluesTable)
+         if not show_tokens:
+             mo = self.sc_re.search(message)
+             if mo:
+                 # Also display the score the message received when it was
+                 # classified.
+                 prob = float(mo.group(1).strip())
+                 results.orig_prob_num = "%.2f%% (%s)" % (prob*100, prob)
+             else:
+                 del results.orig_prob
+             mo = self.ev_re.search(message)
+             if mo:
+                 # Also display the clues as they were when the message was
+                 # classified.
+                 clues = []
+                 evidence = mo.group(1).strip().split(';')
+                 for clue in evidence:
+                     word, prob = clue.strip().split(': ')
+                     clues.append((word.strip("'"), prob))
+                 cluesTable = self._fillCluesTable(clues)
+ 
+                 if subject is None:
+                     heading = "Original clues: (%s)" % (len(evidence),)
+                 else:
+                     heading = "Original clues for: %s (%s)" % (subject,
+                                                                len(evidence),)
+                 orig_results = self._buildBox(heading, 'status.gif',
+                                               cluesTable)
+                 results.cluesBox += orig_results
+         else:
+             del results.orig_prob
          return results
  
!     def onWordquery(self, word, query_type="basic", max_results='10',
!                     ignore_case=False):
!         # It would be nice if the default value for max_results here
!         # always matched the value in ui.html.
!         try:
!             max_results = int(max_results)
!         except ValueError:
!             # Ignore any invalid number, like "foo"
!             max_results = 10
! 
!         original_word = word
! 
!         query = self.html.wordQuery.clone()
!         query.word.value = "%s" % (word,)
!         for q_type in [query.advanced.basic,
!                                query.advanced.wildcard,
!                                query.advanced.regex]:
!             if query_type == q_type.id:
!                 q_type.checked = 'checked'
!                 if query_type != "basic":
!                     del query.advanced.max_results.disabled
!         if ignore_case:
!             query.advanced.ignore_case.checked = 'checked'
!         query.advanced.max_results.value = str(max_results)
!         queryBox = self._buildBox("Word query", 'query.gif', query)
!         if not options["html_ui", "display_adv_find"]:
!             del queryBox.advanced
! 
!         stats = []
          if word == "":
!             stats.append("You must enter a word.")
!         elif query_type == "basic" and not ignore_case:
!             wordinfo = classifier._wordinfoget(word)
!             if wordinfo:
!                 stat = (word, wordinfo.spamcount, wordinfo.hamcount,
!                         classifier.probability(wordinfo))
!             else:
!                 stat = "%r does not exist in the database." % \
!                        cgi.escape(word)
!             stats.append(stat)
          else:
!             if query_type != "regex":
!                 word = re.escape(word)
!             if query_type == "wildcard":
!                 word = word.replace("\\?", ".")
!                 word = word.replace("\\*", ".*")
! 
!             flags = 0
!             if ignore_case:
!                 flags = re.IGNORECASE
!             r = re.compile(word, flags)
! 
!             reached_limit = False
!             for w in classifier._wordinfokeys():
!                 if not reached_limit and len(stats) >= max_results:
!                     reached_limit = True
!                     over_limit = 0
!                 if r.match(w):
!                     if reached_limit:
!                         over_limit += 1
                      else:
!                         wordinfo = classifier._wordinfoget(w)
!                         stat = (w, wordinfo.spamcount, wordinfo.hamcount,
!                                 classifier.probability(wordinfo))
!                         stats.append(stat)
!             if len(stats) == 0 and max_results > 0:
!                 stat = "There are no words that begin with '%s' " \
!                         "in the database." % (word,)
!                 stats.append(stat)
!             elif reached_limit:
!                 if over_limit == 1:
!                     singles = ["was", "match", "is"]
                  else:
!                     singles = ["were", "matches", "are"]
!                 stat = "There %s %d additional %s that %s not " \
!                        "shown here." % (singles[0], over_limit,
!                                         singles[1], singles[2])
!                 stats.append(stat)
  
          self._writePreamble("Word query")
!         if len(stats) == 1:
!             if isinstance(stat, types.TupleType):
!                 stat = self.html.wordStats.clone()
!                 word = stats[0][0]
!                 stat.spamcount = stats[0][1]
!                 stat.hamcount = stats[0][2]
!                 stat.spamprob = "%.6f" % stats[0][3]
!             else:
!                 stat = stats[0]
!                 word = original_word
!             row = self._buildBox("Statistics for '%s'" % \
!                                  cgi.escape(word),
!                                  'status.gif', stat)
!             self.write(row)
!         else:
!             page = self.html.multiStats.clone()
!             page.multiTable = "" # make way for the real rows
!             page.multiTable += self.html.multiHeader.clone()
!             stripe = 0
!             for stat in stats:
!                 if isinstance(stat, types.TupleType):
!                     row = self.html.statsRow.clone()
!                     row.word, row.spamcount, row.hamcount = stat[:3]
!                     row.spamprob = "%.6f" % stat[3]
!                     setattr(row, 'class', ['stripe_on', 'stripe_off'][stripe])
!                     stripe = stripe ^ 1
!                     page.multiTable += row
!                 else:
!                     self.write(self._buildBox("Statistics for '%s'" % \
!                                               cgi.escape(original_word),
!                                               'status.gif', stat))
!             self.write(self._buildBox("Statistics for '%s'" % \
!                                       cgi.escape(original_word), 'status.gif',
!                                       page))
          self.write(queryBox)
          self._writePostamble()

Index: mboxutils.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/mboxutils.py,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** mboxutils.py	14 Jan 2003 05:38:20 -0000	1.2
--- mboxutils.py	19 Sep 2003 23:38:10 -0000	1.3
***************
*** 106,109 ****
--- 106,115 ----
      (everything through the first blank line) are thrown out, and the
      rest of the text is wrapped in a bare email.Message.Message.
+ 
+     Note that we can't use our own message class here, because this
+     function is imported by tokenizer, and our message class imports
+     tokenizer, so we get a circular import problem.  In any case, this
+     function does need anything that our message class offers, so that
+     shouldn't matter.
      """
  

Index: message.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/message.py,v
retrieving revision 1.37
retrieving revision 1.38
diff -C2 -d -r1.37 -r1.38
*** message.py	18 Sep 2003 03:58:59 -0000	1.37
--- message.py	19 Sep 2003 23:38:10 -0000	1.38
***************
*** 88,107 ****
          return not not val
  
- import sys
  import os
  import types
  import re
  
! import email            # for message_from_string
  import email.Message
  import email.Parser
  
! from spambayes.tokenizer import tokenize
  from spambayes.Options import options
  
  from cStringIO import StringIO
  
- from spambayes import dbmstorage
- import shelve
  
  CRLF_RE = re.compile(r'\r\n|\r|\n')
--- 88,109 ----
          return not not val
  
  import os
  import types
+ import math
  import re
+ import sys
+ import types
+ import shelve
  
! import email
  import email.Message
  import email.Parser
  
! from spambayes import dbmstorage
  from spambayes.Options import options
+ from spambayes.tokenizer import tokenize
  
  from cStringIO import StringIO
  
  
  CRLF_RE = re.compile(r'\r\n|\r|\n')
***************
*** 286,290 ****
  
          if options['Headers','include_score']:
!             self[options['Headers','score_header_name']] = str(prob)
  
          if options['Headers','include_thermostat']:
--- 288,300 ----
  
          if options['Headers','include_score']:
!             disp = str(prob)
!             if options["Headers", "header_score_logarithm"]:
!                 if prob<=0.005 and prob>0.0:
!                     x=-math.log10(prob)
!                     disp += " (%d)"%x
!                 if prob>=0.995 and prob<1.0:
!                     x=-math.log10(1.0-prob)
!                     disp += " (%d)"%x
!             self[options['Headers','score_header_name']] = disp
  
          if options['Headers','include_thermostat']:





More information about the Spambayes-checkins mailing list