[Spambayes] [ spambayes-Patches-670417 ]

François Granger francois.granger at free.fr
Mon Jan 20 14:13:05 EST 2003

I got the files from Tony by private mail yesterday night.

Le courrier est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies. Pour des courriers propres :
<http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>

-------------- next part --------------
#!/usr/bin/env python

"""A POP3 proxy that works with classifier.py, and adds a simple
X-Spambayes-Classification header (ham/spam/unsure) to each incoming
email.  You point pop3proxy at your POP3 server, and configure your
email client to collect mail from the proxy then filter on the added
header.  Usage:

    pop3proxy.py [options] [<server> [<server port>]]
        <server> is the name of your real POP3 server
        <port>   is the port number of your real POP3 server, which
                 defaults to 110.

            -z      : Runs a self-test and exits.
            -t      : Runs a fake POP3 server on port 8110 (for testing).
            -h      : Displays this help message.

            -p FILE : use the named database file
            -d      : the database is a DBM file rather than a pickle
            -l port : proxy listens on this port number (default 110)
            -u port : User interface listens on this port number
                      (default 8880; Browse http://localhost:8880/)
            -b      : Launch a web browser showing the user interface.

        All command line arguments and switches take their default
        values from the [pop3proxy] and [html_ui] sections of

For safety, and to help debugging, the whole POP3 conversation is
written out to _pop3proxy.log for each run, if options.verbose is True.

To make rebuilding the database easier, uploaded messages are appended
to _pop3proxyham.mbox and _pop3proxyspam.mbox.

# This module is part of the spambayes project, which is Copyright 2002
# The Python Software Foundation and is covered by the Python Software
# Foundation license.

__author__ = "Richie Hindle <richie at entrian.com>"
__credits__ = "Tim Peters, Neale Pickett, Tim Stone, all the Spambayes folk."

    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0

todo = """

Web training interface:

 o Functional tests.
 o Review already-trained messages, and purge them.
 o Put in a link to view a message (plain text, html, multipart...?)
   Include a Reply link that launches the registered email client, eg.
   mailto:tim at fourstonesExpressions.com?subject=Re:%20pop3proxy&body=Hi%21%0D
 o Keyboard navigation (David Ascher).  But aren't Tab and left/right
   arrow enough?
 o [Francois Granger] Show the raw spambrob number close to the buttons
   (this would mean using the extra X-Hammie header by default).
 o Add Today and Refresh buttons on the Review page.

User interface improvements:

 o Once the pieces are on separate pages, make the paste box bigger.
 o Deployment: Windows executable?  atlaxwin and ctypes?  Or just
 o Can it cleanly dynamically update its status display while having a
   POP3 converation?  Hammering reload sucks.
 o Save the stats (num classified, etc.) between sessions.
 o "Reload database" button.

New features:

 o "Send me an email every [...] to remind me to train on new
 o "Send me a status email every [...] telling how many mails have been
   classified, etc."
 o Possibly integrate Tim Stone's SMTP code - make it use async, make
   the training code update (rather than replace!) the database.
 o Allow use of the UI without the POP3 proxy.
 o Remove any existing X-Spambayes-Classification header from incoming
 o Whitelist.
 o Online manual.
 o Links to project homepage, mailing list, etc.
 o List of words with stats (it would have to be paged!) a la SpamSieve.

Code quality:

 o Make a separate Dibbler plugin for serving images, so there's no
   duplication between pop3proxy and OptionConfig.
 o Move the UI into its own module.
 o Cope with the email client timing out and closing the connection.
 o Lose the trailing dot from cached messages.


 o Slightly-wordy index page; intro paragraph for each page.
 o In both stats and training results, report nham and nspam - warn if
   they're very different (for some value of 'very').
 o "Links" section (on homepage?) to project homepage, mailing list,


 o Classify a web page given a URL.
 o Graphs.  Of something.  Who cares what?
 o NNTP proxy.
 o Zoe...!

Notes, for the sake of somewhere better to put them:

Don't proxy spams at all?  This would mean writing a full POP3 client
and server - it would download all your mail on a timer and serve to you
all the non-spams.  It could be 'safe' in that it leaves the messages in
the real POP3 account until you collect them from it (or in the case of
spams, until you collect contemporaneous hams).  The web interface would
then present all the spams so that you could correct any FPs and mark
them for collection.  The thing is no longer a proxy (because the first
POP3 command in a conversion is STAT or LIST, which tells you how many
mails there are - it wouldn't know the answer, and finding out could
take weeks over a modem - I've already had problems with clients timing
out while the proxy was downloading stuff from the server).

Adam's idea: add checkboxes to a Google results list for "Relevant" /
"Irrelevant", then submit that to build a search including the
highest-scoring tokens and excluding the lowest-scoring ones.

    import cStringIO as StringIO
except ImportError:
    import StringIO

import os, sys, re, operator, errno, getopt, string, time, bisect
import socket, asyncore, asynchat, cgi, urlparse, webbrowser
import mailbox, email.Header
import spambayes
from spambayes import storage, tokenizer, mboxutils, PyMeldLite, Dibbler
from spambayes.FileCorpus import FileCorpus, ExpiryFileCorpus
from spambayes.FileCorpus import FileMessageFactory, GzipFileMessageFactory
from email.Iterators import typed_subpart_iterator
from OptionConfig import OptionsConfigurator
from spambayes.Options import options

# HEADER_EXAMPLE is the longest possible header - the length of this one
# is added to the size of each message.
HEADER_FORMAT = '%s: %%s\r\n' % options.hammie_header_name
HEADER_EXAMPLE = '%s: xxxxxxxxxxxxxxxxxxxx\r\n' % options.hammie_header_name

IMAGES = ('helmet', 'status', 'config',
          'message', 'train', 'classify', 'query')

class ServerLineReader(Dibbler.BrighterAsyncChat):
    """An async socket that reads lines from a remote server and
    simply calls a callback with the data.  The BayesProxy object
    can't connect to the real POP3 server and talk to it
    synchronously, because that would block the process."""

    def __init__(self, serverName, serverPort, lineCallback):
        self.lineCallback = lineCallback
        self.request = ''
        self.create_socket(socket.AF_INET, socket.SOCK_STREAM)
            self.connect((serverName, serverPort))
        except socket.error, e:
            error = "Can't connect to %s:%d: %s" % (serverName, serverPort, e)
            print >>sys.stderr, error
            self.lineCallback('-ERR %s\r\n' % error)
            self.lineCallback('')   # "The socket's been closed."

    def collect_incoming_data(self, data):
        self.request = self.request + data

    def found_terminator(self):
        self.lineCallback(self.request + '\r\n')
        self.request = ''

    def handle_close(self):

class POP3ProxyBase(Dibbler.BrighterAsyncChat):
    """An async dispatcher that understands POP3 and proxies to a POP3
    server, calling `self.onTransaction(request, response)` for each
    transaction. Responses are not un-byte-stuffed before reaching
    self.onTransaction() (they probably should be for a totally generic
    POP3ProxyBase class, but BayesProxy doesn't need it and it would
    mean re-stuffing them afterwards).  self.onTransaction() should
    return the response to pass back to the email client - the response
    can be the verbatim response or a processed version of it.  The
    special command 'KILL' kills it (passing a 'QUIT' command to the

    def __init__(self, clientSocket, serverName, serverPort):
        Dibbler.BrighterAsyncChat.__init__(self, clientSocket)
        self.request = ''
        self.response = ''
        self.command = ''           # The POP3 command being processed...
        self.args = ''              # ...and its arguments
        self.isClosing = False      # Has the server closed the socket?
        self.seenAllHeaders = False # For the current RETR or TOP
        self.startTime = 0          # (ditto)
        self.serverSocket = ServerLineReader(serverName, serverPort,

    def onTransaction(self, command, args, response):
        """Overide this.  Takes the raw request and the response, and
        returns the (possibly processed) response to pass back to the
        email client.
        raise NotImplementedError

    def onServerLine(self, line):
        """A line of response has been received from the POP3 server."""
        isFirstLine = not self.response
        self.response = self.response + line

        # Is this the line that terminates a set of headers?
        self.seenAllHeaders = self.seenAllHeaders or line in ['\r\n', '\n']

        # Has the server closed its end of the socket?
        if not line:
            self.isClosing = True

        # If we're not processing a command, just echo the response.
        if not self.command:
            self.response = ''

        # Time out after 30 seconds for message-retrieval commands if
        # all the headers are down.  The rest of the message will proxy
        # straight through.
        if self.command in ['TOP', 'RETR'] and \
           self.seenAllHeaders and time.time() > self.startTime + 30:
            self.response = ''
        # If that's a complete response, handle it.
        elif not self.isMultiline() or line == '.\r\n' or \
           (isFirstLine and line.startswith('-ERR')):
            self.response = ''

    def isMultiline(self):
        """Returns True if the request should get a multiline
        response (assuming the response is positive).
        if self.command in ['USER', 'PASS', 'APOP', 'QUIT',
                            'STAT', 'DELE', 'NOOP', 'RSET', 'KILL']:
            return False
        elif self.command in ['RETR', 'TOP']:
            return True
        elif self.command in ['LIST', 'UIDL']:
            return len(self.args) == 0
            # Assume that an unknown command will get a single-line
            # response.  This should work for errors and for POP-AUTH,
            # and is harmless even for multiline responses - the first
            # line will be passed to onTransaction and ignored, then the
            # rest will be proxied straight through.
            return False

    ## This is an attempt to solve the problem whereby the email client
    ## times out and closes the connection but the ServerLineReader is still
    ## connected, so you get errors from the POP3 server next time because
    ## there's already an active connection.  But after introducing this,
    ## I kept getting unexplained "Bad file descriptor" errors in recv.
    ## def handle_close(self):
    ##     """If the email client closes the connection unexpectedly, eg.
    ##     because of a timeout, close the server connection."""
    ##     self.serverSocket.shutdown(2)
    ##     self.serverSocket.close()
    ##     self.close()

    def collect_incoming_data(self, data):
        """Asynchat override."""
        self.request = self.request + data

    def found_terminator(self):
        """Asynchat override."""
        verb = self.request.strip().upper()
        if verb == 'KILL':
            raise SystemExit
        elif verb == 'CRASH':
            # For testing
            x = 0
            y = 1/x

        self.serverSocket.push(self.request + '\r\n')
        if self.request.strip() == '':
            # Someone just hit the Enter key.
            self.command = self.args = ''
            # A proper command.
            splitCommand = self.request.strip().split(None, 1)
            self.command = splitCommand[0].upper()
            self.args = splitCommand[1:]
            self.startTime = time.time()

        self.request = ''

    def onResponse(self):
        # Pass the request and the raw response to the subclass and
        # send back the cooked response.
        if self.response:
            cooked = self.onTransaction(self.command, self.args, self.response)

        # If onServerLine() decided that the server has closed its
        # socket, close this one when the response has been sent.
        if self.isClosing:

        # Reset.
        self.command = ''
        self.args = ''
        self.isClosing = False
        self.seenAllHeaders = False

class BayesProxyListener(Dibbler.Listener):
    """Listens for incoming email client connections and spins off
    BayesProxy objects to serve them.

    def __init__(self, serverName, serverPort, proxyPort):
        proxyArgs = (serverName, serverPort)
        Dibbler.Listener.__init__(self, proxyPort, BayesProxy, proxyArgs)
        print 'Listener on port %s is proxying %s:%d' % \
               (_addressPortStr(proxyPort), serverName, serverPort)

class BayesProxy(POP3ProxyBase):
    """Proxies between an email client and a POP3 server, inserting
    judgement headers.  It acts on the following POP3 commands:

     o STAT:
        o Adds the size of all the judgement headers to the maildrop

     o LIST:
        o With no message number: adds the size of an judgement header
          to the message size for each message in the scan listing.
        o With a message number: adds the size of an judgement header
          to the message size.

     o RETR:
        o Adds the judgement header based on the raw headers and body
          of the message.

     o TOP:
        o Adds the judgement header based on the raw headers and as
          much of the body as the TOP command retrieves.  This can
          mean that the header might have a different value for
          different calls to TOP, or for calls to TOP vs. calls to
          RETR.  I'm assuming that the email client will either not
          make multiple calls, or will cope with the headers being

    def __init__(self, clientSocket, serverName, serverPort):
        POP3ProxyBase.__init__(self, clientSocket, serverName, serverPort)
        self.handlers = {'STAT': self.onStat, 'LIST': self.onList,
                         'RETR': self.onRetr, 'TOP': self.onTop}
        state.totalSessions += 1
        state.activeSessions += 1
        self.isClosed = False

    def send(self, data):
        """Logs the data to the log file."""
        if options.verbose:
            return POP3ProxyBase.send(self, data)
        except socket.error:
            # The email client has closed the connection - 40tude Dialog
            # does this immediately after issuing a QUIT command,
            # without waiting for the response.

    def recv(self, size):
        """Logs the data to the log file."""
        data = POP3ProxyBase.recv(self, size)
        if options.verbose:
        return data

    def close(self):
        # This can be called multiple times by async.
        if not self.isClosed:
            self.isClosed = True
            state.activeSessions -= 1

    def onTransaction(self, command, args, response):
        """Takes the raw request and response, and returns the
        (possibly processed) response to pass back to the email client.
        handler = self.handlers.get(command, self.onUnknown)
        return handler(command, args, response)

    def onStat(self, command, args, response):
        """Adds the size of all the judgement headers to the maildrop
        match = re.search(r'^\+OK\s+(\d+)\s+(\d+)(.*)\r\n', response)
        if match:
            count = int(match.group(1))
            size = int(match.group(2)) + len(HEADER_EXAMPLE) * count
            return '+OK %d %d%s\r\n' % (count, size, match.group(3))
            return response

    def onList(self, command, args, response):
        """Adds the size of an judgement header to the message
        if response.count('\r\n') > 1:
            # Multiline: all lines but the first contain a message size.
            lines = response.split('\r\n')
            outputLines = [lines[0]]
            for line in lines[1:]:
                match = re.search('^(\d+)\s+(\d+)', line)
                if match:
                    number = int(match.group(1))
                    size = int(match.group(2)) + len(HEADER_EXAMPLE)
                    line = "%d %d" % (number, size)
            return '\r\n'.join(outputLines)
            # Single line.
            match = re.search('^\+OK\s+(\d+)(.*)\r\n', response)
            if match:
                size = int(match.group(1)) + len(HEADER_EXAMPLE)
                return "+OK %d%s\r\n" % (size, match.group(2))
                return response

    def onRetr(self, command, args, response):
        """Adds the judgement header based on the raw headers and body
        of the message."""
        # Use '\n\r?\n' to detect the end of the headers in case of
        # broken emails that don't use the proper line separators.
        if re.search(r'\n\r?\n', response):
            # Break off the first line, which will be '+OK'.
            ok, messageText = response.split('\n', 1)

            # Now find the spam disposition and add the header.
            prob = state.bayes.spamprob(tokenizer.tokenize(messageText))
            if prob < options.ham_cutoff:
                disposition = options.header_ham_string
                if command == 'RETR':
                    state.numHams += 1
            elif prob > options.spam_cutoff:
                disposition = options.header_spam_string
                if command == 'RETR':
                    state.numSpams += 1
                disposition = options.header_unsure_string
                if command == 'RETR':
                    state.numUnsure += 1

            header = '%s: %s\r\n' % (options.hammie_header_name, disposition)
            headers, body = re.split(r'\n\r?\n', messageText, 1)
            headers = headers + "\n" + header + "\r\n"
            messageText = headers + body

            # Cache the message; don't pollute the cache with test messages.
            if command == 'RETR' and not state.isTest:
                # The message name is the time it arrived, with a uniquifier
                # appended if two arrive within one clock tick of each other.
                messageName = "%10.10d" % long(time.time())
                if messageName == state.lastBaseMessageName:
                    state.lastBaseMessageName = messageName
                    messageName = "%s-%d" % (messageName, state.uniquifier)
                    state.uniquifier += 1
                    state.lastBaseMessageName = messageName
                    state.uniquifier = 2

                # Write the message into the Unknown cache.
                message = state.unknownCorpus.makeMessage(messageName)

            # Return the +OK and the message with the header added.
            return ok + "\n" + messageText

            # Must be an error response.
            return response

    def onTop(self, command, args, response):
        """Adds the judgement header based on the raw headers and as
        much of the body as the TOP command retrieves."""
        # Easy (but see the caveat in BayesProxy.__doc__).
        return self.onRetr(command, args, response)

    def onUnknown(self, command, args, response):
        """Default handler; returns the server's response verbatim."""
        return response

class UserInterfaceServer(Dibbler.HTTPServer):
    """Implements the web server component via a Dibbler plugin."""

    def __init__(self, uiPort):
        Dibbler.HTTPServer.__init__(self, uiPort)
        print 'User interface url is http://localhost:%d/' % (uiPort)

def readUIResources():
    """Returns ui.html and a dictionary of Gifs.  Used here and by

    # Using `exec` is nasty, but I couldn't figure out a way of making
    # `getattr` or `__import__` work with ResourcePackage.
    from spambayes.resources import ui_html
    images = {}
    for imageName in IMAGES:
        exec "from spambayes.resources import %s_gif" % imageName
        exec "images[imageName] = %s_gif.data" % imageName
    return ui_html.data, images

class UserInterface(Dibbler.HTTPPlugin):
    """Serves the HTML user interface of the proxy."""

    def __init__(self):
        """Load up the necessary resources: ui.html and helmet.gif."""
        htmlSource, self._images = readUIResources()
        self.html = PyMeldLite.Meld(htmlSource, readonly=True)

    def onIncomingConnection(self, clientSocket):
        """Checks the security settings."""
        return options.html_ui_allow_remote_connections or \
               clientSocket.getpeername()[0] == clientSocket.getsockname()[0]

    def _writePreamble(self, name, showImage=True):
        """Writes the HTML for the beginning of a page - time-consuming
        methlets use this and `_writePostamble` to write the page in
        pieces, including progress messages."""

        # Take the whole palette and remove the content and the footer,
        # leaving the header and an empty body.
        html = self.html.clone()
        html.mainContent = " "
        del html.footer

        # Add in the name of the page and remove the link to Home if this
        # *is* Home.
        html.title = name
        if name == 'Home':
            del html.homelink
            html.pagename = "Home"
            html.pagename = "> " + name

        # Remove the helmet image if we're not showing it - this happens on
        # shutdown because the browser might ask for the image after we've
        # exited.
        if not showImage:
            del html.helmet

        # Strip the closing tags, so we push as far as the start of the main
        # content.  We'll push the closing tags at the end.
        self.write(re.sub(r'</div>\s*</body>\s*</html>', '', str(html)))

    def _writePostamble(self):
        """Writes the end of time-consuming pages - see `_writePreamble`."""
        footer = self.html.footer.clone()
        footer.timestamp = time.asctime(time.localtime())
        self.write("</div>" + self.html.footer)

    def _trimHeader(self, field, limit, quote=False):
        """Trims a string, adding an ellipsis if necessary and HTML-quoting
        on request.  Also pumps it through email.Header.decode_header, which
        understands charset sections in email headers - I suspect this will
        only work for Latin character sets, but hey, it works for Francois
        Granger's name.  8-)"""

        sections = email.Header.decode_header(field)
        field = ' '.join([text for text, unused in sections])
        if len(field) > limit:
            field = field[:limit-3] + "..."
        if quote:
            field = cgi.escape(field)
        return field

    def onHome(self):
        """Serve up the homepage."""
        stateDict = state.__dict__.copy()
        statusTable = self.html.statusTable.clone()
        if not state.servers:
            statusTable.proxyDetails = "No POP3 proxies running."
        content = (self._buildBox('Status and Configuration',
                                  'status.gif', statusTable % stateDict)+
                   self._buildBox('Train on proxied messages',
                                  'train.gif', self.html.reviewText) +
                   self._buildTrainBox() +
                   self._buildClassifyBox() +
                   self._buildBox('Word query', 'query.gif',

    def _doSave(self):
        """Saves the database."""
        self.write("<b>Saving... ")

    def onSave(self, how):
        """Command handler for "Save" and "Save & shutdown"."""
        isShutdown = how.lower().find('shutdown') >= 0
        self._writePreamble("Save", showImage=(not isShutdown))
        if isShutdown:
            self.write("<p>%s</p>" % self.html.shutdownMessage)
            ## Is this still required?: self.shutdown(2)
            raise SystemExit

    def onTrain(self, file, text, which):
        """Train on an uploaded or pasted message."""

        # Upload or paste?  Spam or ham?
        content = file or text
        isSpam = (which == 'Train as Spam')

        # Convert platform-specific line endings into unix-style.
        content = content.replace('\r\n', '\n').replace('\r', '\n')

        # Single message or mbox?
        if content.startswith('From '):
            # Get a list of raw messages from the mbox content.
            class SimpleMessage:
                def __init__(self, fp):
                    self.guts = fp.read()
            contentFile = StringIO.StringIO(content)
            mbox = mailbox.PortableUnixMailbox(contentFile, SimpleMessage)
            messages = map(lambda m: m.guts, mbox)
            # Just the one message.
            messages = [content]

        # Append the message(s) to a file, to make it easier to rebuild
        # the database later.   This is a temporary implementation -
        # it should keep a Corpus of trained messages.
        if isSpam:
            f = open("_pop3proxyspam.mbox", "a")
            f = open("_pop3proxyham.mbox", "a")

        # Train on the uploaded message(s).
        for message in messages:
            tokens = tokenizer.tokenize(message)
            state.bayes.learn(tokens, isSpam)
            f.write("From pop3proxy at spambayes.org Sat Jan 31 00:00:00 2000\n")

        # Save the database and return a link Home and another training form.
        self.write("<p>OK. Return <a href='home'>Home</a> or train again:</p>")

    def _keyToTimestamp(self, key):
        """Given a message key (as seen in a Corpus), returns the timestamp
        for that message.  This is the time that the message was received,
        not the Date header."""
        return long(key[:10])

    def _getTimeRange(self, timestamp):
        """Given a unix timestamp, returns a 3-tuple: the start timestamp
        of the given day, the end timestamp of the given day, and the
        formatted date of the given day."""
        # This probably works on Summertime-shift days; time will tell.  8-)
        this = time.localtime(timestamp)
        start = (this[0], this[1], this[2], 0, 0, 0, this[6], this[7], this[8])
        end = time.localtime(time.mktime(start) + 36*60*60)
        end = (end[0], end[1], end[2], 0, 0, 0, end[6], end[7], end[8])
        date = time.strftime("%A, %B %d, %Y", start)
        return time.mktime(start), time.mktime(end), date

    def _buildReviewKeys(self, timestamp):
        """Builds an ordered list of untrained message keys, ready for output
        in the Review list.  Returns a 5-tuple: the keys, the formatted date
        for the list (eg. "Friday, November 15, 2002"), the start of the prior
        page or zero if there isn't one, likewise the start of the given page,
        and likewise the start of the next page."""
        # Fetch all the message keys and sort them into timestamp order.
        allKeys = state.unknownCorpus.keys()

        # The default start timestamp is derived from the most recent message,
        # or the system time if there are no messages (not that it gets used).
        if not timestamp:
            if allKeys:
                timestamp = self._keyToTimestamp(allKeys[-1])
                timestamp = time.time()
        start, end, date = self._getTimeRange(timestamp)

        # Find the subset of the keys within this range.
        startKeyIndex = bisect.bisect(allKeys, "%d" % long(start))
        endKeyIndex = bisect.bisect(allKeys, "%d" % long(end))
        keys = allKeys[startKeyIndex:endKeyIndex]

        # What timestamps to use for the prior and next days?  If there any
        # messages before/after this day's range, use the timestamps of those
        # messages - this will skip empty days.
        prior = end = 0
        if startKeyIndex != 0:
            prior = self._keyToTimestamp(allKeys[startKeyIndex-1])
        if endKeyIndex != len(allKeys):
            end = self._keyToTimestamp(allKeys[endKeyIndex])

        # Return the keys and their date.
        return keys, date, prior, start, end

    def _makeMessageInfo(self, message):
        """Given an email.Message, return an object with subjectHeader,
        fromHeader and bodySummary attributes.  These objects are passed into
        appendMessages by onReview - passing email.Message objects directly
        uses too much memory."""
        subjectHeader = message["Subject"] or "(none)"
        fromHeader = message["From"] or "(none)"
            part = typed_subpart_iterator(message, 'text', 'plain').next()
            text = part.get_payload()
        except StopIteration:
                part = typed_subpart_iterator(message, 'text', 'html').next()
                text = part.get_payload()
                text, unused = tokenizer.crack_html_style(text)
                text, unused = tokenizer.crack_html_comment(text)
                text = tokenizer.html_re.sub(' ', text)
                text = '(this message only has an HTML body)\n' + text
            except StopIteration:
                text = '(this message has no text body)'
        text = text.replace('&nbsp;', ' ')      # Else they'll be quoted
        text = re.sub(r'(\s)\s+', r'\1', text)  # Eg. multiple blank lines
        text = text.strip()

        class _MessageInfo:
        messageInfo = _MessageInfo()
        messageInfo.subjectHeader = self._trimHeader(subjectHeader, 50, True)
        messageInfo.fromHeader = self._trimHeader(fromHeader, 40, True)
        messageInfo.bodySummary = self._trimHeader(text, 200)
        return messageInfo

    def _appendMessages(self, table, keyedMessageInfo, label):
        """Appends the rows of a table of messages to 'table'."""
        stripe = 0
        for key, messageInfo in keyedMessageInfo:
            row = self.html.reviewRow.clone()
            if label == 'Spam':
                row.spam.checked = 1
            elif label == 'Ham':
                row.ham.checked = 1
                row.defer.checked = 1
            row.subject = messageInfo.subjectHeader
            row.subject.title = messageInfo.bodySummary
            row.from_ = messageInfo.fromHeader
            setattr(row, 'class', ['stripe_on', 'stripe_off'][stripe]) # Grr!
            row = str(row).replace('TYPE', label).replace('KEY', key)
            table += row
            stripe = stripe ^ 1

    def onReview(self, **params):
        """Present a list of message for (re)training."""
        # Train/discard sumbitted messages.
        id = ''
        numTrained = 0
        numDeferred = 0
        for key, value in params.items():
            if key.startswith('classify:'):
                id = key.split(':')[2]
                if value == 'spam':
                    targetCorpus = state.spamCorpus
                elif value == 'ham':
                    targetCorpus = state.hamCorpus
                elif value == 'discard':
                    targetCorpus = None
                    except KeyError:
                        pass  # Must be a reload.
                else: # defer
                    targetCorpus = None
                    numDeferred += 1
                if targetCorpus:
                        targetCorpus.takeMessage(id, state.unknownCorpus)
                        if numTrained == 0:
                            self.write("<p><b>Training... ")
                        numTrained += 1
                    except KeyError:
                        pass  # Must be a reload.

        # Report on any training, and save the database if there was any.
        if numTrained > 0:
            plural = ''
            if numTrained != 1:
                plural = 's'
            self.write("Trained on %d message%s. " % (numTrained, plural))

        # If any messages were deferred, show the same page again.
        if numDeferred > 0:
            start = self._keyToTimestamp(id)

        # Else after submitting a whole page, display the prior page or the
        # next one.  Derive the day of the submitted page from the ID of the
        # last processed message.
        elif id:
            start = self._keyToTimestamp(id)
            unused, unused, prior, unused, next = self._buildReviewKeys(start)
            if prior:
                start = prior
                start = next

        # Else if they've hit Previous or Next, display that page.
        elif params.get('go') == 'Next day':
            start = self._keyToTimestamp(params['next'])
        elif params.get('go') == 'Previous day':
            start = self._keyToTimestamp(params['prior'])

        # Else show the most recent day's page, as decided by _buildReviewKeys.
            start = 0

        # Build the lists of messages: spams, hams and unsure.
        keys, date, prior, this, next = self._buildReviewKeys(start)
        keyedMessageInfo = {options.header_spam_string: [],
                            options.header_ham_string: [],
                            options.header_unsure_string: []}
        for key in keys:
            # Parse the message, get the judgement header and build a message
            # info object for each message.
            cachedMessage = state.unknownCorpus[key]
            message = mboxutils.get_message(cachedMessage.getSubstance())
            judgement = message[options.hammie_header_name] or \
            messageInfo = self._makeMessageInfo(message)
            keyedMessageInfo[judgement].append((key, messageInfo))

        # Present the list of messages in their groups in reverse order of
        # appearance.
        if keys:
            page = self.html.reviewtable.clone()
            if prior:
                page.prior.value = prior
                del page.priorButton.disabled
            if next:
                page.next.value = next
                del page.nextButton.disabled
            templateRow = page.reviewRow.clone()
            page.table = ""  # To make way for the real rows.
            for header, label in ((options.header_spam_string, 'Spam'),
                                  (options.header_ham_string, 'Ham'),
                                  (options.header_unsure_string, 'Unsure')):
                messages = keyedMessageInfo[header]
                if messages:
                    subHeader = str(self.html.reviewSubHeader)
                    subHeader = subHeader.replace('TYPE', label)
                    page.table += self.html.blankRow
                    page.table += subHeader
                    self._appendMessages(page.table, messages, label)

            page.table += self.html.trainRow
            title = "Untrained messages received on %s" % date
            box = self._buildBox(title, None, page)  # No icon, to save space.
            page = "<p>There are no untrained messages to display. "
            page += "Return <a href='home'>Home</a>.</p>"
            title = "No untrained messages"
            box = self._buildBox(title, 'status.gif', page)


    def onClassify(self, file, text, which):
        """Classify an uploaded or pasted message."""
        message = file or text
        message = message.replace('\r\n', '\n').replace('\r', '\n') # For Macs
        tokens = tokenizer.tokenize(message)
        probability, clues = state.bayes.spamprob(tokens, evidence=True)

        cluesTable = self.html.cluesTable.clone()
        cluesRow = cluesTable.cluesRow.clone()
        del cluesTable.cluesRow   # Delete dummy row to make way for real ones
        for word, wordProb in clues:
            cluesTable += cluesRow % (word, wordProb)

        results = self.html.classifyResults.clone()
        results.probability = probability
        results.cluesBox = self._buildBox("Clues:", 'status.gif', cluesTable)
        results.classifyAnother = self._buildClassifyBox()

    def onWordquery(self, word):
        word = word.lower()
        wordinfo = state.bayes._wordinfoget(word)
        if wordinfo:
            stats = self.html.wordStats.clone()
            stats.spamcount = wordinfo.spamcount
            stats.hamcount = wordinfo.hamcount
            stats.spamprob = state.bayes.probability(wordinfo)
            stats = "%r does not exist in the database." % word

        query = self.html.wordQuery.clone()
        query.word.value = word
        statsBox = self._buildBox("Statistics for %r" % word,
                                  'status.gif', stats)
        queryBox = self._buildBox("Word query", 'query.gif', query)
        self._writePreamble("Word query")
        self.write(statsBox + queryBox)

    def _writeImage(self, image):

    # If you are easily offended, look away now...
    for imageName in IMAGES:
        exec "def %s(self): self._writeImage('%s')" % \
             ("on%sGif" % imageName.capitalize(), imageName)

    def _buildBox(self, heading, icon, content):
        """Builds a yellow-headed HTML box."""
        box = self.html.headedBox.clone()
        box.heading = heading
        if icon:
            box.icon.src = icon
            del box.iconCell
        box.boxContent = content
        return box

    def _buildClassifyBox(self):
        """Returns a "Classify a message" box.  This is used on both the Home
        page and the classify results page.  The Classify form is based on the
        Upload form."""

        form = self.html.upload.clone()
        del form.or_mbox
        del form.submit_spam
        del form.submit_ham
        form.action = "classify"
        return self._buildBox("Classify a message", 'classify.gif', form)

    def _buildTrainBox(self):
        """Returns a "Train on a given message" box.  This is used on both
        the Home page and the training results page.  The Train form is
        based on the Upload form."""

        form = self.html.upload.clone()
        del form.submit_classify
        return self._buildBox("Train on a given message", 'message.gif', form)

    def reReadOptions(self):
        """Called by the config page when the user saves some new options, or
        restores the defaults."""
        # Reload the options.
        global state
        global options
        from spambayes.Options import options

        # Recreate the state.
        state = State()

        # Close the exsiting listeners and create new ones.  This won't
        # affect any running proxies - once a listener has created a proxy,
        # that proxy is then independent of it.
        for proxy in proxyListeners:
        del proxyListeners[:]
        _createProxies(state.servers, state.proxyPorts)

# This keeps the global state of the module - the command-line options,
# statistics like how many mails have been classified, the handle of the
# log file, the Classifier and FileCorpus objects, and so on.
class State:
    def __init__(self):
        """Initialises the State object that holds the state of the app.
        The default settings are read from Options.py and bayescustomize.ini
        and are then overridden by the command-line processing code in the
        __main__ code below."""
        # Open the log file.
        if options.verbose:
            self.logFile = open('_pop3proxy.log', 'wb', 0)

        # Load up the old proxy settings from Options.py / bayescustomize.ini
        # and give warnings if they're present.   XXX Remove these soon.
        self.servers = []
        self.proxyPorts = []
        if options.pop3proxy_port != 110 or \
           options.pop3proxy_server_name != '' or \
           options.pop3proxy_server_port != 110:
            print "\n    pop3proxy_port, pop3proxy_server_name and"
            print "    pop3proxy_server_port are deprecated!  Please use"
            print "    pop3proxy_servers and pop3proxy_ports instead.\n"
            self.servers = [(options.pop3proxy_server_name,
            self.proxyPorts = [options.pop3proxy_port]

        # Load the new proxy settings - these will override the old ones
        # if both are present.
        if options.pop3proxy_servers:
            for server in options.pop3proxy_servers.split(','):
                server = server.strip()
                if server.find(':') > -1:
                    server, port = server.split(':', 1)
                    port = '110'
                self.servers.append((server, int(port)))

        if options.pop3proxy_ports:
            splitPorts = options.pop3proxy_ports.split(',')
            self.proxyPorts = map(_addressAndPort, splitPorts)

        if len(self.servers) != len(self.proxyPorts):
            print "pop3proxy_servers & pop3proxy_ports are different lengths!"

        # Load up the other settings from Option.py / bayescustomize.ini
        self.useDB = options.pop3proxy_persistent_use_database
        self.uiPort = options.html_ui_port
        self.launchUI = options.html_ui_launch_browser
        self.gzipCache = options.pop3proxy_cache_use_gzip
        self.cacheExpiryDays = options.pop3proxy_cache_expiry_days
        self.runTestServer = False
        self.isTest = False

        # Set up the statistics.
        self.totalSessions = 0
        self.activeSessions = 0
        self.numSpams = 0
        self.numHams = 0
        self.numUnsure = 0

        # Unique names for cached messages - see BayesProxy.onRetr
        self.lastBaseMessageName = ''
        self.uniquifier = 2

    def buildServerStrings(self):
        """After the server details have been set up, this creates string
        versions of the details, for display in the Status panel."""
        serverStrings = ["%s:%s" % (s, p) for s, p in self.servers]
        self.serversString = ', '.join(serverStrings)
        self.proxyPortsString = ', '.join(map(_addressPortStr, self.proxyPorts))

    def createWorkers(self):
        """Using the options that were initialised in __init__ and then
        possibly overridden by the driver code, create the Bayes object,
        the Corpuses, the Trainers and so on."""
        print "Loading database...",
        if self.isTest:
            self.useDB = True
            options.pop3proxy_persistent_storage_file = \
                        '_pop3proxy_test.pickle'   # This is never saved.
        if self.useDB:
            self.bayes = storage.DBDictClassifier( \
            self.bayes = storage.PickledClassifier(\
        print "Done."

        # Don't set up the caches and training objects when running the self-test,
        # so as not to clutter the filesystem.
        if not self.isTest:
            def ensureDir(dirname):
                except OSError, e:
                    if e.errno != errno.EEXIST:

            # Create/open the Corpuses.  Use small cache sizes to avoid hogging
            # lots of memory.
            map(ensureDir, [options.pop3proxy_spam_cache,
            if self.gzipCache:
                factory = GzipFileMessageFactory()
                factory = FileMessageFactory()
            age = options.pop3proxy_cache_expiry_days*24*60*60
            self.spamCorpus = ExpiryFileCorpus(age, factory,
            self.hamCorpus = ExpiryFileCorpus(age, factory,
            self.unknownCorpus = FileCorpus(factory,

            # Expire old messages from the trained corpuses.

            # Create the Trainers.
            self.spamTrainer = storage.SpamTrainer(self.bayes)
            self.hamTrainer = storage.HamTrainer(self.bayes)

# option-parsing helper functions

def _addressAndPort(s):
   "Decode a string representing a port to bind to, with optional address"
   if ':' in s: 
     addr, port = s.strip().split(':')
     return addr, int(port)
     return '', int(s)

def _addressPortStr((addr, port)):
  "Encode a string representing a port to bind to, with optional address"
  if not addr: 
    return str(port)
    return '%s:%d' % (addr, port)

state = State()
proxyListeners = []
def _createProxies(servers, proxyPorts):
    """Create BayesProxyListeners for all the given servers."""
    for (server, serverPort), proxyPort in zip(servers, proxyPorts):
        listener = BayesProxyListener(server, serverPort, proxyPort)

def main(servers, proxyPorts, uiPort, launchUI):
    """Runs the proxy forever or until a 'KILL' command is received or
    someone hits Ctrl+Break."""
    _createProxies(servers, proxyPorts)
    httpServer = UserInterfaceServer(uiPort)
    proxyUI = UserInterface()
    httpServer.register(proxyUI, OptionsConfigurator(proxyUI))

# ===================================================================
# Test code.
# ===================================================================

# One example of spam and one of ham - both are used to train, and are
# then classified.  Not a good test of the classifier, but a perfectly
# good test of the POP3 proxy.  The bodies of these came from the
# spambayes project, and I added the headers myself because the
# originals had no headers.

spam1 = """From: friend at public.com
Subject: Make money fast

Hello tim_chandler , Want to save money ?
Now is a good time to consider refinancing. Rates are low so you can cut
your current payments and save money.

Take off list on site [s5]

good1 = """From: chris at example.com
Subject: ZPT and DTML

Jean Jordaan wrote:
> 'Fraid so ;>  It contains a vintage dtml-calendar tag.
>   http://www.zope.org/Members/teyc/CalendarTag
> Hmm I think I see what you mean: one needn't manually pass on the
> namespace to a ZPT?

Yeah, Page Templates are a bit more clever, sadly, DTML methods aren't :-(


class TestListener(Dibbler.Listener):
    """Listener for TestPOP3Server.  Works on port 8110, to co-exist
    with real POP3 servers."""

    def __init__(self, socketMap=asyncore.socket_map):
        Dibbler.Listener.__init__(self, 8110, TestPOP3Server,
                                  (socketMap,), socketMap=socketMap)

class TestPOP3Server(Dibbler.BrighterAsyncChat):
    """Minimal POP3 server, for testing purposes.  Doesn't support
    UIDL.  USER, PASS, APOP, DELE and RSET simply return "+OK"
    without doing anything.  Also understands the 'KILL' command, to
    kill it.  The mail content is the example messages above.

    def __init__(self, clientSocket, socketMap):
        # Grumble: asynchat.__init__ doesn't take a 'map' argument,
        # hence the two-stage construction.
        Dibbler.BrighterAsyncChat.set_socket(self, clientSocket, socketMap)
        self.maildrop = [spam1, good1]
        self.okCommands = ['USER', 'PASS', 'APOP', 'NOOP',
                           'DELE', 'RSET', 'QUIT', 'KILL']
        self.handlers = {'STAT': self.onStat,
                         'LIST': self.onList,
                         'RETR': self.onRetr,
                         'TOP': self.onTop}
        self.push("+OK ready\r\n")
        self.request = ''

    def collect_incoming_data(self, data):
        """Asynchat override."""
        self.request = self.request + data

    def found_terminator(self):
        """Asynchat override."""
        if ' ' in self.request:
            command, args = self.request.split(None, 1)
            command, args = self.request, ''
        command = command.upper()
        if command in self.okCommands:
            self.push("+OK (we hope)\r\n")
            if command == 'QUIT':
            if command == 'KILL':
                raise SystemExit
            handler = self.handlers.get(command, self.onUnknown)
            self.push(handler(command, args))   # Or push_slowly for testing
        self.request = ''

    def push_slowly(self, response):
        """Useful for testing."""
        for c in response:

    def onStat(self, command, args):
        """POP3 STAT command."""
        maildropSize = reduce(operator.add, map(len, self.maildrop))
        maildropSize += len(self.maildrop) * len(HEADER_EXAMPLE)
        return "+OK %d %d\r\n" % (len(self.maildrop), maildropSize)

    def onList(self, command, args):
        """POP3 LIST command, with optional message number argument."""
        if args:
                number = int(args)
            except ValueError:
                number = -1
            if 0 < number <= len(self.maildrop):
                return "+OK %d\r\n" % len(self.maildrop[number-1])
                return "-ERR no such message\r\n"
            returnLines = ["+OK"]
            for messageIndex in range(len(self.maildrop)):
                size = len(self.maildrop[messageIndex])
                returnLines.append("%d %d" % (messageIndex + 1, size))
            return '\r\n'.join(returnLines) + '\r\n'

    def _getMessage(self, number, maxLines):
        """Implements the POP3 RETR and TOP commands."""
        if 0 < number <= len(self.maildrop):
            message = self.maildrop[number-1]
            headers, body = message.split('\n\n', 1)
            bodyLines = body.split('\n')[:maxLines]
            message = headers + '\r\n\r\n' + '\n'.join(bodyLines)
            return "+OK\r\n%s\r\n.\r\n" % message
            return "-ERR no such message\r\n"

    def onRetr(self, command, args):
        """POP3 RETR command."""
            number = int(args)
        except ValueError:
            number = -1
        return self._getMessage(number, 12345)

    def onTop(self, command, args):
        """POP3 RETR command."""
            number, lines = map(int, args.split())
        except ValueError:
            number, lines = -1, -1
        return self._getMessage(number, lines)

    def onUnknown(self, command, args):
        """Unknown POP3 command."""
        return "-ERR Unknown command: %s\r\n" % repr(command)

def test():
    """Runs a self-test using TestPOP3Server, a minimal POP3 server
    that serves the example emails above.
    # Run a proxy and a test server in separate threads with separate
    # asyncore environments.
    import threading
    state.isTest = True
    testServerReady = threading.Event()
    def runTestServer():
        testSocketMap = {}

    proxyReady = threading.Event()
    def runUIAndProxy():
        httpServer = UserInterfaceServer(8881)
        proxyUI = UserInterface()
        httpServer.register(proxyUI, OptionsConfigurator(proxyUI))
        BayesProxyListener('localhost', 8110, 8111)
        state.bayes.learn(tokenizer.tokenize(spam1), True)
        state.bayes.learn(tokenizer.tokenize(good1), False)


    # Connect to the proxy.
    proxy = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    proxy.connect(('localhost', 8111))
    response = proxy.recv(100)
    assert response == "+OK ready\r\n"

    # Stat the mailbox to get the number of messages.
    response = proxy.recv(100)
    count, totalSize = map(int, response.split()[1:3])
    assert count == 2

    # Loop through the messages ensuring that they have judgement
    # headers.
    for i in range(1, count+1):
        response = ""
        proxy.send("retr %d\r\n" % i)
        while response.find('\n.\r\n') == -1:
            response = response + proxy.recv(1000)
        assert response.find(options.hammie_header_name) >= 0

    # Smoke-test the HTML UI.
    httpServer = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    httpServer.connect(('localhost', 8881))
    httpServer.sendall("get / HTTP/1.0\r\n\r\n")
    response = ''
    while 1:
        packet = httpServer.recv(1000)
        if not packet: break
        response += packet
    assert re.search(r"(?s)<html>.*Spambayes proxy.*</html>", response)

    # Kill the proxy and the test server.
    pop3Server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    pop3Server.connect(('localhost', 8110))

# ===================================================================
# __main__ driver.
# ===================================================================

def run():
    # Read the arguments.
        opts, args = getopt.getopt(sys.argv[1:], 'htdbzp:l:u:')
    except getopt.error, msg:
        print >>sys.stderr, str(msg) + '\n\n' + __doc__

    runSelfTest = False
    for opt, arg in opts:
        if opt == '-h':
            print >>sys.stderr, __doc__
        elif opt == '-t':
            state.isTest = True
            state.runTestServer = True
        elif opt == '-b':
            state.launchUI = True
        elif opt == '-d':
            state.useDB = True
        elif opt == '-p':
            options.pop3proxy_persistent_storage_file = arg
        elif opt == '-l':
            state.proxyPorts = [_addressAndPort(arg)]
        elif opt == '-u':
            state.uiPort = int(arg)
        elif opt == '-z':
            state.isTest = True
            runSelfTest = True

    # Do whatever we've been asked to do...
    if runSelfTest:
        print "\nRunning self-test...\n"
        print "Self-test passed."   # ...else it would have asserted.

    elif state.runTestServer:
        print "Running a test POP3 server on port 8110..."

    elif 0 <= len(args) <= 2:
        # Normal usage, with optional server name and port number.
        if len(args) == 1:
            state.servers = [(args[0], 110)]
        elif len(args) == 2:
            state.servers = [(args[0], int(args[1]))]

        main(state.servers, state.proxyPorts, state.uiPort, state.launchUI)

        print >>sys.stderr, __doc__

if __name__ == '__main__':
-------------- next part --------------


Dibbler is a Python web application framework.  It lets you create web-based
applications by writing independent plug-in modules that don't require any
networking code.  Dibbler takes care of the HTTP side of things, leaving you
to write the application code.

*Plugins and Methlets*

Dibbler uses a system of plugins to implement the application logic.  Each
page maps to a 'methlet', which is a method of a plugin object that serves
that page, and is named after the page it serves.  The address
`http://server/spam` calls the methlet `onSpam`.  `onHome` is a reserved
methlet name for the home page, `http://server/`.  For resources that need a
file extension (eg. images) you can use a URL such as `http://server/eggs.gif`
to map to the `onEggsGif` methlet.  All the registered plugins are searched
for the appropriate methlet, so you can combine multiple plugins to build
your application.

A methlet needs to call `self.writeOKHeaders('text/html')` followed by
`self.write(content)`.  You can pass whatever content-type you like to
`writeOKHeaders`, so serving images, PDFs, etc. is no problem.  If a methlet
wants to return an HTTP error code, it should call (for example)
`self.writeError(403, "Forbidden")` instead of `writeOKHeaders`
and `write`.  If it wants to write its own headers (for instance to return
a redirect) it can simply call `write` with the full HTTP response.

If a methlet raises an exception, it is automatically turned into a "500
Server Error" page with a full traceback in it.


Methlets can take parameters, the values of which are taken from form
parameters submitted by the browser.  So if your form says
`<form action='subscribe'><input type="text" name="email"/> ...` then your
methlet should look like `def onSubscribe(self, email=None)`.  It's good
practice to give all the parameters default values, in case the user navigates
to that URL without submitting a form, or submits the form without filling in
any parameters.  If you have lots of parameters, or their names are determined
at runtime, you can define your methlet like this:
`def onComplex(self, **params)` to get a dictionary of parameters.


Here's a web application server that serves a calendar for a given year:

>>> import Dibbler, calendar
>>> class Calendar(Dibbler.HTTPPlugin):
...     _form = '''<html><body><h3>Calendar Server</h3>
...                <form action='/'>
...                Year: <input type='text' name='year' size='4'>
...                <input type='submit' value='Go'></form>
...                <pre>%s</pre></body></html>'''
...     def onHome(self, year=None):
...         if year:
...             result = calendar.calendar(int(year))
...         else:
...             result = ""
...         self.writeOKHeaders('text/html')
...         self.write(self._form % result)
>>> httpServer = Dibbler.HTTPServer(8888)
>>> httpServer.register(Calendar())
>>> Dibbler.run(launchBrowser=True)

Your browser will start, and you can ask for a calendar for the year of
your choice.  If you don't want to start the browser automatically, just call
`run()` with no arguments - the application is available at
http://localhost:8888/ .  You'll have to kill the server manually because it
provides no way to stop it; a real application would have some kind of
'shutdown' methlet that called `sys.exit()`.

By combining Dibbler with an HTML manipulation library like
PyMeld (shameless plug - see http://entrian.com/PyMeld for details) you can
keep the HTML and Python code separate.

*Building applications*

You can run several plugins together like this:

>>> httpServer = Dibbler.HTTPServer()
>>> httpServer.register(plugin1, plugin2, plugin3)
>>> Dibbler.run()

...so many plugin objects, each implementing a different set of pages,
can cooperate to implement a web application.  See also the `HTTPServer`
documentation for details of how to run multiple `Dibbler` environments
simultaneously in different threads.

*Controlling connections*

There are times when your code needs to be informed the moment an incoming
connection is received, before any HTTP conversation begins.  For instance,
you might want to only accept connections from `localhost` for security
reasons.  If this is the case, your plugin should implement the
`onIncomingConnection` method.  This will be passed the incoming socket
before any reads or writes have taken place, and should return True to allow
the connection through or False to reject it.  Here's an implementation of
the `localhost`-only idea:

>>> def onIncomingConnection(self, clientSocket):
>>>     return clientSocket.getpeername()[0] == clientSocket.getsockname()[0]

*Advanced usage: Dibbler Contexts*

If you want to run several independent Dibbler environments (in different
threads for example) then each should use its own `Context`.  Normally
you'd say something like:

>>> httpServer = Dibbler.HTTPServer()
>>> httpServer.register(MyPlugin())
>>> Dibbler.run()

but that's only safe to do from one thread.  Instead, you can say:

>>> myContext = Dibbler.Context()
>>> httpServer = Dibbler.HTTPServer(context=myContext)
>>> httpServer.register(MyPlugin())
>>> Dibbler.run(myContext)

in as many threads as you like.

*Dibbler and asyncore*

If this section means nothing to you, you can safely ignore it.

Dibbler is built on top of Python's asyncore library, which means that it
integrates into other asyncore-based applications, and you can write other
asyncore-based components and run them as part of the same application.

By default, Dibbler uses the default asyncore socket map.  This means that
`Dibbler.run()` also runs your asyncore-based components, provided they're
using the default socket map.  If you want to tell Dibbler to use a
different socket map, either to co-exist with other asyncore-based components
using that map or to insulate Dibbler from such components by using a
different map, you need to use a `Dibbler.Context`.  If you're using your own
socket map, give it to the context: `context = Dibbler.Context(myMap)`.  If
you want Dibbler to use its own map: `context = Dibbler.Context({})`.

You can either call `Dibbler.run(context)` to run the async loop, or call
`asyncore.loop()` directly - the only difference is that the former has a
few more options, like launching the web browser automatically.


Running `Dibbler.py` directly as a script runs the example calendar server
plus a self-test.

# Dibbler is released under the Python Software Foundation license; see
# http://www.python.org/

__author__ = "Richie Hindle <richie at entrian.com>"
__credits__ = "Tim Stone"

    import cStringIO as StringIO
except ImportError:
    import StringIO

import os, sys, re, time, traceback
import socket, asyncore, asynchat, cgi, urlparse, webbrowser

    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0

class BrighterAsyncChat(asynchat.async_chat):
    """An asynchat.async_chat that doesn't give spurious warnings on
    receiving an incoming connection, lets SystemExit cause an exit, can
    flush its output, and will correctly remove itself from a non-default
    socket map on `close()`."""

    def __init__(self, conn=None, map=None):
        """See `asynchat.async_chat`."""
        asynchat.async_chat.__init__(self, conn)
        self._map = map

    def handle_connect(self):
        """Suppresses the asyncore "unhandled connect event" warning."""

    def handle_error(self):
        """Let SystemExit cause an exit."""
        type, v, t = sys.exc_info()
        if type == socket.error and v[0] == 9:  # Why?  Who knows...
        elif type == SystemExit:

    def flush(self):
        """Flush everything in the output buffer."""
        while self.producer_fifo or self.ac_out_buffer:

    def close(self):
        """Remove this object from the correct socket map."""

class Context:
    """See the main documentation for details of `Dibbler.Context`."""
    def __init__(self, asyncMap=asyncore.socket_map):
        self._HTTPPort = None  # Stores the port for `run(launchBrowser=True)`
        self._map = asyncMap

_defaultContext = Context()

class Listener(asyncore.dispatcher):
    """Generic listener class used by all the different types of server.
    Listens for incoming socket connections and calls a factory function
    to create handlers for them."""

    def __init__(self, port, factory, factoryArgs,
        """Creates a listener object, which will listen for incoming
        connections when Dibbler.run is called:

         o port: The TCP/IP (address, port) to listen on. Usually '' -
           meaning bind to all IP addresses that the machine has - will be
           passed as the address. For backwards interface compatibility, if
           port is just an int, an address of '' will be assumed.

         o factory: The function to call to create a handler (can be a class

         o factoryArgs: The arguments to pass to the handler factory.  For
           proper context support, this should include a `context` argument
           (or a `socketMap` argument for pure asyncore listeners).  The
           incoming socket will be prepended to this list, and passed as the
           first argument.  See `HTTPServer` for an example.

         o socketMap: Optional.  The asyncore socket map to use.  If you're
           using a `Dibbler.Context`, pass context._map.

        See `HTTPServer` for an example `Listener` - it's a good deal smaller
        than this description!"""

        asyncore.dispatcher.__init__(self, map=socketMap)
        self.socketMap = socketMap
        self.factory = factory
        self.factoryArgs = factoryArgs
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        self.set_socket(s, self.socketMap)
        if type(port) != type(()):
          port = ('', port)

    def handle_accept(self):
        """Asyncore override."""
        # If an incoming connection is instantly reset, eg. by following a
        # link in the web interface then instantly following another one or
        # hitting stop, handle_accept() will be triggered but accept() will
        # return None.
        result = self.accept()
        if result:
            clientSocket, clientAddress = result
            args = [clientSocket] + list(self.factoryArgs)

class HTTPServer(Listener):
    """A web server with which you can register `HTTPPlugin`s to serve up
    your content - see `HTTPPlugin` for detailed documentation and examples.

    `port` specifies the TCP/IP (address, port) on which to run, defaulting 
    to ('', 80).

    `context` optionally specifies a `Dibbler.Context` for the server.

    def __init__(self, port=('', 80), context=_defaultContext):
        """Create an `HTTPServer` for the given port."""
        Listener.__init__(self, port, _HTTPHandler,
                          (self, context), context._map)
        self._plugins = []
        context._HTTPPort = port

    def register(self, *plugins):
        """Registers one or more `HTTPPlugin`-derived objects with the
        for plugin in plugins:

class _HTTPHandler(BrighterAsyncChat):
    """This is a helper for the HTTP server class - one of these is created
    for each incoming request, and does the job of decoding the HTTP traffic
    and driving the plugins."""

    def __init__(self, clientSocket, server, context):
        # Grumble: asynchat.__init__ doesn't take a 'map' argument,
        # hence the two-stage construction.
        BrighterAsyncChat.__init__(self, map=context._map)
        BrighterAsyncChat.set_socket(self, clientSocket, context._map)
        self._server = server
        self._request = ''

        # Because a methlet is likely to call `writeOKHeaders` before doing
        # anything else, an unexpected exception won't send back a 500, which
        # is poor.  So we buffer any sent headers until either a plain `write`
        # happens or the methlet returns.
        self._bufferedHeaders = []
        self._headersWritten = False

        # Tell the plugins about the connection, letting them veto it.
        for plugin in self._server._plugins:
            if not plugin.onIncomingConnection(clientSocket):

    def collect_incoming_data(self, data):
        """Asynchat override."""
        self._request = self._request + data

    def found_terminator(self):
        """Asynchat override."""
        # Parse the HTTP request.
        requestLine, headers = (self._request+'\r\n').split('\r\n', 1)
            method, url, version = requestLine.strip().split()
        except ValueError:
            self.pushError(400, "Malformed request: '%s'" % requestLine)

        # Parse the URL, and deal with POST vs. GET requests.
        method = method.upper()
        unused, unused, path, unused, query, unused = urlparse.urlparse(url)
        cgiParams = cgi.parse_qs(query, keep_blank_values=True)
        if self.get_terminator() == '\r\n\r\n' and method == 'POST':
            # We need to read the body - set a numeric async_chat terminator
            # equal to the Content-Length.
            match = re.search(r'(?i)content-length:\s*(\d+)', headers)
            contentLength = int(match.group(1))
            if contentLength > 0:
                self._request = self._request + '\r\n\r\n'

        # Have we just read the body of a POSTed request?  Decode the body,
        # which will contain parameters and possibly uploaded files.
        if type(self.get_terminator()) is type(1):
            body = self._request.split('\r\n\r\n', 1)[1]
            match = re.search(r'(?i)content-type:\s*([^\r\n]+)', headers)
            contentTypeHeader = match.group(1)
            contentType, pdict = cgi.parse_header(contentTypeHeader)
            if contentType == 'multipart/form-data':
                # multipart/form-data - probably a file upload.
                bodyFile = StringIO.StringIO(body)
                cgiParams.update(cgi.parse_multipart(bodyFile, pdict))
                # A normal x-www-form-urlencoded.
                cgiParams.update(cgi.parse_qs(body, keep_blank_values=True))

        # Convert the cgi params into a simple dictionary.
        params = {}
        for name, value in cgiParams.iteritems():
            params[name] = value[0]

        # Find and call the methlet.  '/eggs.gif' becomes 'onEggsGif'.
        if path == '/':
            path = '/Home'
        pieces = path[1:].split('.')
        name = 'on' + ''.join([piece.capitalize() for piece in pieces])
        for plugin in self._server._plugins:
            if hasattr(plugin, name):
                # The plugin's APIs (`write`, etc) reflect back to us via
                # `plugin._handler`.
                plugin._handler = self
                    # Call the methlet.
                    getattr(plugin, name)(**params)
                    if self._bufferedHeaders:
                        # The methlet returned without writing anything other
                        # than headers.  This isn't unreasonable - it might
                        # have written a 302 or something.  Flush the buffered
                        # headers
                    # The methlet raised an exception - send the traceback to
                    # the browser, unless it's SystemExit in which case we let
                    # it go.
                    eType, eValue, eTrace = sys.exc_info()
                    if eType == SystemExit:
                    message = """<h3>500 Server error</h3><pre>%s</pre>"""
                    details = traceback.format_exception(eType, eValue, eTrace)
                    details = '\n'.join(details)
                    self.writeError(500, message % cgi.escape(details))
                plugin._handler = None
            self.onUnknown(path, params)

        # `close_when_done` and `Connection: close` ensure that we don't
        # support keep-alives or pipelining.  There are problems with some
        # browsers, for instance with extra characters being appended after
        # the body of a POSTed request.

    def onUnknown(self, path, params):
        """Handler for unknown URLs.  Returns a 404 page."""
        self.writeError(404, "Not found: '%s'" % path)

    def writeOKHeaders(self, contentType, extraHeaders={}):
        """Reflected from `HTTPPlugin`s."""
        # Buffer the headers until there's a `write`, in case an error occurs.
        timeNow = time.gmtime(time.time())
        httpNow = time.strftime('%a, %d %b %Y %H:%M:%S GMT', timeNow)
        headers = []
        headers.append("HTTP/1.1 200 OK")
        headers.append("Connection: close")
        headers.append("Content-Type: %s" % contentType)
        headers.append("Date: %s" % httpNow)
        for name, value in extraHeaders.items():
            headers.append("%s: %s" % (name, value))
        self._bufferedHeaders = headers

    def writeError(self, code, message):
        """Reflected from `HTTPPlugin`s."""
        # Writing an error overrides any buffered headers, but obviously
        # doesn't want to write any headers if some have already gone.
        headers = []
        if not self._headersWritten:
            headers.append("HTTP/1.0 %d Error" % code)
            headers.append("Connection: close")
            headers.append("Content-Type: text/html")
        self.push("%s<html><body>%s</body></html>" % \
                  ('\r\n'.join(headers), message))

    def write(self, content):
        """Reflected from `HTTPPlugin`s."""
        # The methlet is writing, so write any buffered headers first.
        headers = []
        if self._bufferedHeaders:
            headers = self._bufferedHeaders
            self._bufferedHeaders = None
            self._headersWritten = True

        # `write(None)` just flushes buffered headers.
        if content is None:
            content = ''
        self.push('\r\n'.join(headers) + str(content))

class HTTPPlugin:
    """Base class for HTTP server plugins.  See the main documentation for

    def __init__(self):
        # self._handler is filled in by `HTTPHandler.found_terminator()`.

    def onIncomingConnection(self, clientSocket):
        """Implement this and return False to veto incoming connections."""
        return True

    def writeOKHeaders(self, contentType, extraHeaders={}):
        """A methlet should call this with the Content-Type and optionally
        a dictionary of extra headers (eg. Expires) before calling
        return self._handler.writeOKHeaders(contentType, extraHeaders)

    def writeError(self, code, message):
        """A methlet should call this instead of `writeOKHeaders()` /
        `write()` to report an HTTP error (eg. 403 Forbidden)."""
        return self._handler.writeError(code, message)

    def write(self, content):
        """A methlet should call this after `writeOKHeaders` to write the
        page's content."""
        return self._handler.write(content)

    def flush(self):
        """A methlet can call this after calling `write`, to ensure that
        the content is written immediately to the browser.  This isn't
        necessary most of the time, but if you're writing "Please wait..."
        before performing a long operation, calling `flush()` is a good
        return self._handler.flush()

    def close(self, flush=True):
        """Closes the connection to the browser.  You should call `close()`
        before calling `sys.exit()` in any 'shutdown' methlets you write."""
        if flush:
        return self._handler.close()

def run(launchBrowser=False, context=_defaultContext):
    """Runs a `Dibbler` application.  Servers listen for incoming connections
    and route requests through to plugins until a plugin calls `sys.exit()`
    or raises a `SystemExit` exception."""

    if launchBrowser:
        webbrowser.open_new("http://localhost:%d/" % context._HTTPPort)

def runTestServer(readyEvent=None):
    """Runs the calendar server example, with an added `/shutdown` URL."""
    import Dibbler, calendar
    class Calendar(Dibbler.HTTPPlugin):
        _form = '''<html><body><h3>Calendar Server</h3>
                   <form action='/'>
                   Year: <input type='text' name='year' size='4'>
                   <input type='submit' value='Go'></form>

        def onHome(self, year=None):
            if year:
                result = calendar.calendar(int(year))
                result = ""
            self.write(self._form % result)

        def onShutdown(self):

    httpServer = Dibbler.HTTPServer(8888)
    if readyEvent:
        # Tell the self-test code that the test server is up and running.

def test():
    """Run a self-test."""
    # Run the calendar server in a separate thread.
    import re, threading, urllib
    testServerReady = threading.Event()
    threading.Thread(target=runTestServer, args=(testServerReady,)).start()

    # Connect to the server and ask for a calendar.
    page = urllib.urlopen("http://localhost:8888/?year=2003").read()
    if page.find('January') != -1:
        print "Self test passed."
        print "Self-test failed!"

    # Wait for a key while the user plays with his browser.
    raw_input("Press any key to shut down the application server...")

    # Ask the server to shut down.
    page = urllib.urlopen("http://localhost:8888/shutdown").read()
    if page.find('OK') != -1:
        print "Shutdown OK."
        print "Shutdown failed!"

if __name__ == '__main__':
-------------- next part --------------
Skipped content of type multipart/appledouble

More information about the Spambayes mailing list