This is more a cautionary tale then a real problem, but it brings up a couple of issues to chew on.
I started having major problems with mailman when I brought lists.apple.com live (I'll have more to say about this site later, since there are a couple of things I need to look into more fully before I core-dump on that install...)
The main problem was that I was getting huge numbers of messages to the -admin addresses that were blank. Zero. First, I thought it was a corrupted list database. then I thought it was a corrupted request database. Then I thought it was a corrupted message in the qfile dir that was causing corrupted messages to multiply. Then I didn't know what to think, so I just started taking the syste apart piece by piece and running qfile messages through ONE AT A TIME to see where the probelm came from. My favorite way ot spend a weekend, that's for sure... (grin)
End result -- one minor configuration error in the mailer. One of the hostnames I use wasn't set up as a local name, so sendmail kept erroring out trying to talk to itself in one special case. But the bigger issue was -- the system was doing exactly what I told it to do.
I use demime to strip incoming e-mail to the text part. this works really pretty well. At some point, however, instead of just attaching demime to the posting and -request address, I also added it to the admin address.
Most incoming bounces now are in MIME format. End result: they come to the -admin address, the mime gets stripped, and an empty message results. Since it's no longer a bounce message, it gets sent to the admin. load in a fairly dirty subscriber list and start sending messages -- and you get 10K blank message in your mailbox in the morning.
Cautionary note: after you double-check all your configuration files for problems, make syure you double-check all the custom stuff you did that you did it right. The "good" thing about this particular problem is that while I was busy mailbombing myself and my admins all weekend fighting this beast, to the end user, the site worked fine... If you HAVE to have problems, problems that arne't visible to the end user are preferable...
But it brings up a couple of issues I see with qrunner.
first, it seems like qrunner re-stats the qfiles dir and reloads its idea of what needs to be run. This creates a problem when you have lots of messages, since it's not processing things FIFO -- I found that some older messages were simply NEVER being run, because however qrunner was choosing messages out of qfiles, it wasn't choosing them. On a busy system, this can be a problem. I suggest instead that qrunner start up, grab the list of messages to run, and run them, oldest first, then exit. Let the next Qrunner handle what comes in in the meantime. That way, things are run more of a FIFO, and you don't get into the lost-stepchild queue file problem.
second, qrunner isn't good at letting me know what it's doing. If I'm trying to figure out what it's processing, it's not telling me. When trying to debug a possible corrupted file, that's a real hassle. It'd be nice if it put something in qfiles that told me what fileset it was working on, just so I can whack at it if I need to.
all in all, it's been a, um, fun weekend. But I now have demime doing what it's supposed to be doing, and it is working a LOT better. And it explains (in retrospeect) why, knowing the subscriber lists were dirty, I wasn't seeing very many bounces... (grumble. That should have been a hint. Hindsight is fun...)
*now* it's stable... (I think)
-- Chuq Von Rospach - Plaidworks Consulting (mailto:chuqui@plaidworks.com) Apple Mail List Gnome (mailto:chuq@apple.com)
Be just, and fear not.
"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
CVR> first, it seems like qrunner re-stats the qfiles dir and
CVR> reloads its idea of what needs to be run.
It shouldn't. As soon as it enters main(), it does a listdir() of the qfiles dir. Once it's processed everything it sees in that listing, it exits. qrunner can exit sooner if a few kludgey resource management parameters are exceeded, but a single invocation of qrunner should never list the directory a second time.
It /could/ be that if you've just got tons of messages in the queue and Mailman has a hard time keeping up, files that are unlucky enough to always show up at the end of the directory listing will never get processed.
CVR> This creates a problem when you have lots of messages, since
CVR> it's not processing things FIFO -- I found that some older
CVR> messages were simply NEVER being run, because however qrunner
CVR> was choosing messages out of qfiles, it wasn't choosing them.
If you see files that are never getting run, and you don't think you're seeing problem above, do a dumpdb of the corresponding .db file. If you see a `pipeline' entry, say with SMTPDirect in the pipeline, chances are you're getting errors in that delivery module and Mailman's keeping it on the queue. Check logs/smtp for details.
CVR> second, qrunner isn't good at letting me know what it's
CVR> doing.
A sin of much of the system currently. I hope I can revamp and improve the logging facility for 2.1.
-Barry
At 3:59 PM -0500 10/30/00, Barry A. Warsaw wrote:
It shouldn't. As soon as it enters main(), it does a listdir() of the qfiles dir. Once it's processed everything it sees in that listing, it exits. qrunner can exit sooner if a few kludgey resource management parameters are exceeded, but a single invocation of qrunner should never list the directory a second time.
It /could/ be that if you've just got tons of messages in the queue and Mailman has a hard time keeping up, files that are unlucky enough to always show up at the end of the directory listing will never get processed.
Okay, interesting. it sure seemed like stuff was hanging out from run to run, but the system was having a few issues at the time, so it could have been erro related as well.
If you see files that are never getting run, and you don't think you're seeing problem above, do a dumpdb of the corresponding .db file. If you see a `pipeline' entry, say with SMTPDirect in the pipeline, chances are you're getting errors in that delivery module and Mailman's keeping it on the queue. Check logs/smtp for details.
ah, that probably explains it. There were circumstances where the SMTP host started rejecting due to load, and those probably cascaded.
That seems to open up mailman to duplicate deliveries, FWIW. I think I ran into the same issue here Chris Kolar did...
-- Chuq Von Rospach - Plaidworks Consulting (mailto:chuqui@plaidworks.com) Apple Mail List Gnome (mailto:chuq@apple.com)
Be just, and fear not.
"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
CVR> ah, that probably explains it. There were circumstances where
CVR> the SMTP host started rejecting due to load, and those
CVR> probably cascaded.
CVR> That seems to open up mailman to duplicate deliveries,
CVR> FWIW. I think I ran into the same issue here Chris Kolar
CVR> did...
Can you explain in more detail what "SMTP host started rejecting due to load" means? Do you mean the socket connect failed, or the SMTP server returned error codes, or something else.
I ask because I have a simple Python smtpd for testing and if I can configure it to reproduce exactly the error conditions your seeing with sendmail, I can try to debug the dups problem. I'd /really/ like to do that before 2.0 final goes out, since others are seeing this problem too (mostly sendmail users it seems though).
Thanks, -Barry
barry@wooz.org wrote:
"CVR" == Chuq Von Rospach <chuqui@plaidworks.com> writes:
CVR> ah, that probably explains it. There were circumstances where CVR> the SMTP host started rejecting due to load, and those CVR> probably cascaded. CVR> That seems to open up mailman to duplicate deliveries, CVR> FWIW. I think I ran into the same issue here Chris Kolar CVR> did...
Can you explain in more detail what "SMTP host started rejecting due to load" means? Do you mean the socket connect failed, or the SMTP server returned error codes, or something else.
I ask because I have a simple Python smtpd for testing and if I can configure it to reproduce exactly the error conditions your seeing with sendmail, I can try to debug the dups problem. I'd /really/ like to do that before 2.0 final goes out, since others are seeing this problem too (mostly sendmail users it seems though).
Thanks, -Barry
I'm not sure if this is relevant or not, but I did report some time ago a problem in python's smtplib.py that leaked fd's, and brought about a bug in MailList.py (Mailman 1.1). As far as I'm aware, this bug was not fixed in Mailman's copy of smtplib.py, nor in the new version of Python. It lost file descriptors when sendmail quit accepting connections due to too high of a load. -Dan
-- Dan A. Dickey ddickey@wamnet.com
At 3:28 PM -0600 10/30/00, Dan A. Dickey wrote:
I'm not sure if this is relevant or not, but I did report some time ago a problem in python's smtplib.py that leaked fd's, and brought about a bug in MailList.py (Mailman 1.1). As far as I'm aware, this bug was not fixed in Mailman's copy of smtplib.py, nor in the new version of Python. It lost file descriptors when sendmail quit accepting connections due to too high of a load. -Dan
and when you run out of fd's, you get an error attempting to connect and exit. I'll bet that's it, Dan.
-- Chuq Von Rospach - Plaidworks Consulting (mailto:chuqui@plaidworks.com) Apple Mail List Gnome (mailto:chuq@apple.com)
Be just, and fear not.
Okay, let me try to reproduce this. I guess I don't even need my smtpd.py :)
@anthem[[~/projects/mailman:1073]]% telnet localhost 9999 Trying 127.0.0.1... telnet: Unable to connect to remote host: Connection refused
-Barry
I'm still having trouble reproducing dups. I set SMTPPORT=9999 in mm_cfg.py, sent a bunch of messages into the system, and manually ran cron/qrunner a bunch of times. They all fail as expected (connection refused), and I see the log messages in smtp/post, exactly as I expect. The .db files look right -- they all have entries for `pipeline' which start with SMTPDirect.py.
I comment out the SMTPPORT, re-run qrunner and all the messages go through exactly once.
;(
Any other ideas? Do you see any other relevant messages in any of the other log files?
-Barry
Chuq Von Rospach wrote:
At 3:28 PM -0600 10/30/00, Dan A. Dickey wrote:
I'm not sure if this is relevant or not, but I did report some time ago a problem in python's smtplib.py that leaked fd's, and brought about a bug in MailList.py (Mailman 1.1). As far as I'm aware, this bug was not fixed in Mailman's copy of smtplib.py, nor in the new version of Python. It lost file descriptors when sendmail quit accepting connections due to too high of a load. -Dan
and when you run out of fd's, you get an error attempting to connect and exit. I'll bet that's it, Dan.
I won't bet on it, but I will go so far as to say it has a possibility.
barry@wooz.org wrote:
I'm still having trouble reproducing dups. I set SMTPPORT=9999 in mm_cfg.py, sent a bunch of messages into the system, and manually ran cron/qrunner a bunch of times. They all fail as expected (connection refused), and I see the log messages in smtp/post, exactly as I expect. The .db files look right -- they all have entries for `pipeline' which start with SMTPDirect.py.
I comment out the SMTPPORT, re-run qrunner and all the messages go through exactly once.
;(
Any other ideas? Do you see any other relevant messages in any of the other log files?
Running out of fd's is somewhat of a problem. It was a bit tricky to find - since, once you are out of fds - you can't really open up a file to drop a log message into it.
Bleah. I was just looking around for my patches so I could attach them, and I'm sorry for a bit of misinformation - the problem is not in MailList.py; that was a different change I made to Mailman. The problem is indeed directly in smtplib.py. The patch I made to it to fix the fd leak problem is attached. If this fixes the problem, you win your bet Chuq. :) -Dan
P.S. - Please keep in mind that this patch was against pythonlibs/smtplib.py from Mailman 1.1. I have yet to move up to 2.0 (waiting for it to become stable). P.P.S - Yes, this bug and patch needs to get to the Python group. Sooner the better I'd say.
-- Dan A. Dickey ddickey@wamnet.com
*** pythonlib/smtplib.py.orig Thu Dec 9 08:48:44 1999 --- pythonlib/smtplib.py Mon Apr 24 10:09:28 2000
*** 213,219 **** if not port: port = SMTP_PORT self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) if self.debuglevel > 0: print 'connect:', (host, port) ! self.sock.connect(host, port) (code,msg)=self.getreply() if self.debuglevel >0 : print "connect:", msg return (code,msg) --- 213,224 ---- if not port: port = SMTP_PORT self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) if self.debuglevel > 0: print 'connect:', (host, port) ! try: ! self.sock.connect(host, port) ! except: ! if self.debuglevel > 0: print 'connect failed, raising sock.error' ! self.close() ! raise socket.error, "connect failed" (code,msg)=self.getreply() if self.debuglevel >0 : print "connect:", msg return (code,msg)
How about this patch instead? If it looks good to you, I'll add it to pythonlib/smtplib.py and upload it to the Python project's patch manager. -Barry -------------------- snip snip -------------------- Index: smtplib.py =================================================================== RCS file: /cvsroot/python/python/dist/src/Lib/smtplib.py,v retrieving revision 1.29 diff -u -r1.29 smtplib.py --- smtplib.py 2000/09/01 06:40:07 1.29 +++ smtplib.py 2000/10/31 15:55:51 @@ -214,7 +214,11 @@ if not port: port = SMTP_PORT self.sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) if self.debuglevel > 0: print 'connect:', (host, port) - self.sock.connect((host, port)) + try: + self.sock.connect((host, port)) + except socket.error: + self.close() + raise (code,msg)=self.getreply() if self.debuglevel >0 : print "connect:", msg return (code,msg)
"DAD" == Dan A Dickey <ddickey@wamnet.com> writes:
>> How about this patch instead? If it looks good to you, I'll
>> add it to pythonlib/smtplib.py and upload it to the Python
>> project's patch manager.
DAD> Looks good to me. -Dan
Cool, done.
participants (3)
-
barry@wooz.org
-
Chuq Von Rospach
-
Dan A. Dickey