
I'd like to produce improved bounce statistics from Mailman's logs; for example, I'd like to track, per message, how many recipients there were and how many of them bounced.
This means the logs need to include the message ID. Most of Mailman's logs (smtp, post, vette) include the ID, but the critical missing piece is the bounce log. To close the loop, I'l need to extract the ID of the message that was bounced. (The message ID of the bounce message itself isn't useful, and may not even exist; qmail's bounces apparently don't have IDs.)
How should I write the code to extract the ID? Looking through the bounce test messages, there are various formats so we'll need several functions, similar to how there are several bounced-address parsers in Mailman.Bouncers. Should I:
add a extract_message_id() function to all of the modules in Mailman.Bouncers, which currently just have a process() function?
have a new package, Mailman.Bouncers.MessageId or whatever, that has several modules, analogous to the existing bounce analysis?
have a bunch of analysis functions in one module, and a single master function that tries all of them?
I think 3) is the simplest course, but not too simple to be workable. Looking at the test bounces, finding the message ID is much simpler than finding the bounce address; searching for a small number of strings such as 'Original message follows' will often find the original headers. Can anyone see a reason that the more complicated
- or 2) would be necessary?
(I'll start a branch for this too, aimed at getting the change into 2.2.)
--amk

A.M. Kuchling wrote:
How should I write the code to extract the ID? Looking through the bounce test messages, there are various formats so we'll need several functions, similar to how there are several bounced-address parsers in Mailman.Bouncers. Should I:
I don't think we do.
I ran the following
import os import re import email
hre = re.compile('^>?\s*message-id:\s*(<.*>)', re.IGNORECASE) for f in os.listdir('.'): if not f.endswith('.txt'): continue msg = email.message_from_file(open(f)) messageid = None inheaders = True for line in msg.as_string().splitlines(): if inheaders: if line == '': inheaders = False continue mo = hre.search(line) if mo: messageid = mo.group(1) break print '%s: %s' % (f, messageid)
in current Mailman's test/bounces/ directory which contains 86 DSNs. Of those 86, 12 have no message id for the original message. Of the remaining 74, all message ids are found with the above.
If the re is changed to
hre = re.compile('^message-id:\s*(<.*>)', re.IGNORECASE)
73 of the 74 are found. llnl_01.txt has the 'original message' quoted with '>' characters. A few mesages have the messsage id in a report section with leading whitespace, but they all have it later as well without leading whitespace.
In any case, I think the
hre = re.compile('^>?\s*message-id:\s*(<.*>)', re.IGNORECASE)
re will likely find anything to be found and is unlikely to find false hits.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On Fri, Mar 07, 2008 at 04:02:12PM -0800, Mark Sapiro wrote:
In any case, I think the
hre = re.compile('^>?\s*message-id:\s*(<.*>)', re.IGNORECASE)
re will likely find anything to be found and is unlikely to find false hits.
Excellent point! I've written up a patch using the regex: https://sourceforge.net/tracker/?func=detail&atid=300103&aid=1911318&group_id=103
One minor test-suite oddity I noticed: about a third of the bounce messages in tests/bounces/ have mbox-style 'From VM Wed Mar 21 22:20:23 2001' header lines. Most of them seem harmless, but the one in hotpop_01.txt breaks the email.Message() parser because the line is ">From daemon Tue Nov 13 13:43:50 2001". Worth fixing?
--amk

A.M. Kuchling wrote:
On Fri, Mar 07, 2008 at 04:02:12PM -0800, Mark Sapiro wrote:
In any case, I think the
hre = re.compile('^>?\s*message-id:\s*(<.*>)', re.IGNORECASE)
re will likely find anything to be found and is unlikely to find false hits.
Excellent point! I've written up a patch using the regex: https://sourceforge.net/tracker/?func=detail&atid=300103&aid=1911318&group_id=103
I've looked at the patch, and I wonder why it doesn't include the '<' and '>' in the log entries. According to RFC 2822, the '<' and '>' are part of the msg-id, and more importantly, all existing bounce, post, smtp and vette log entries with msg-ids include them
One minor test-suite oddity I noticed: about a third of the bounce messages in tests/bounces/ have mbox-style 'From VM Wed Mar 21 22:20:23 2001' header lines. Most of them seem harmless, but the one in hotpop_01.txt breaks the email.Message() parser because the line is ">From daemon Tue Nov 13 13:43:50 2001". Worth fixing?
Yes. I'll remove that bogus line from the test message.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On Mon, Mar 10, 2008 at 02:24:57PM -0700, Mark Sapiro wrote:
I've looked at the patch, and I wonder why it doesn't include the '<' and '>' in the log entries. According to RFC 2822, the '<' and '>' are
I had a vague wrong impression that Mailman didn't include them in the logs; I've uploaded an updated version of the patch that keeps the angle brackets.
--amk
participants (2)
-
A.M. Kuchling
-
Mark Sapiro