[spambayes-dev] New sort+group.py

Tim Peters tim.one at comcast.net
Sat Dec 27 04:58:31 EST 2003


Attached is a major rewrite of testtools/sort+group.py.  Anyone who uses
that, please give it a try.  If nobody gripes, I'll check it in.  (If you're
on Linux, the attached probably has Windows line ends, and you may need to
change that.)

It's used exactly the same way as before, and creates filenames with the
same pattern as before, *except* that any pre-existing extension (like
".txt" on Windows) is preserved.  Extensions are necessary for sane life on
Windows, but the code currently checked in strips extensions as part of
renaming.

The major thrust of the changes is to order msgs by full-precision UTC
timestamp.  It was sorting just by date (not time), and wasn't accounting
for that different ISPs may be in different time zones.  It also failed to
parse many of the Received headers in my email, partly because Comcast's
Received headers don't make any attempt to keep the date-time part on a
single line.  Other failures were due to "unusual" spellings in the
date-time part.  Instead email.Utils.parsedate_tz() is used to parse this
stuff, and that didn't fail on any of the email I've tried so far.

Almost all Received headers I see have hour:minute:second info, and since I
do incremental training during the day, as email comes in, it's important to
me that the email be ordered at finer granularity than "a day".  A second
should be good enough <wink>.  My various ISPs are in different time zones
too, and normalizing to UTC should help model that, e.g., the first time I
see a new spam campaign it's much more likely to arrive from my MSN account
than from my Comcast account.
-------------- next part --------------
#! /usr/bin/env python

### Sort and group the messages in the Data hierarchy.
### Run this prior to mksets.py for setting stuff up for
### testing of chronological incremental training.

"""Usage: sort+group.py

This program has no options!  Muahahahaha!
"""

import sys
import os
import glob
import time

from email.Utils import parsedate_tz, mktime_tz

loud = True
SECONDS_PER_DAY = 24 * 60 * 60

# Scan the file with path fpath for its first Received header, and return
# a UTC timestamp for the date-time it specifies.  If anything goes wrong
# (can't find a Received header; can't parse the date), return None.
# This is the best guess about when we received the msg.
def get_time(fpath):
    fh = file(fpath, "rb")
    # Find first Received header.
    for line in fh:
        if line.lower().startswith("received:"):
            break
    else:
        print "\nNo Received header found."
        fh.close()
        return None
    # Paste on the continuation lines.
    received = line
    for line in fh:
        if line[0] in ' \t':
            received += line
        else:
            break
    fh.close()
    # RFC 2822 says the date-time field must follow a semicolon at the end.
    i = received.rfind(';')
    if i < 0:
        print "\n" + received
        print "No semicolon found in Received header."
        return None
    # We only want the part after the semicolon.
    datestring = received[i+1:]
    # It may still be split across lines (like "Wed, \r\n\t22 Oct ...").
    datestring = ' '.join(datestring.split())
    as_tuple = parsedate_tz(datestring)
    if as_tuple is None:
        print "\n" + received
        print "Couldn't parse the date: %r" % datestring
        return None
    return mktime_tz(as_tuple)

def main():
    """Main program; parse options and go."""

    data = []   # list of (time_received, path) pairs
    now = time.time()
    if loud:
        print "Scanning everything"
    for name in glob.glob('Data/*/*/*'):
        if loud:
            sys.stdout.write("%-78s\r" % name)
            sys.stdout.flush()
        when_received = get_time(name)
        data.append((when_received or now, name))

    if loud:
        print ""
        print "Sorting ..."
    data.sort()

    # First rename all the files to a form we can't produce in the end.
    # This is to protect against name clashes in case the files are
    # already named according to the scheme we use.
    if loud:
        print "Renaming first pass ..."
    for dummy, name in data:
        dirname = os.path.dirname(name)
        basename = os.path.basename(name)
        os.rename(name, os.path.join(dirname, "-"+basename))

    if loud:
        print "Renaming second pass ..."
    earliest = data[0][0]  # timestamp of earliest msg received
    for i, (when_received, name) in enumerate(data):
        dirname = os.path.dirname(name)
        basename = os.path.basename(name)
        extension = os.path.splitext(basename)[-1]
        group = int((when_received - earliest) / SECONDS_PER_DAY)
        newbasename = "%04d-%06d%s" % (group, i, extension)
        os.rename(os.path.join(dirname, "-"+basename),
                  os.path.join(dirname, newbasename))

if __name__ == "__main__":
    main()


More information about the spambayes-dev mailing list