[ mailman-Bugs-558988 ] bad performance for big queue dirs
Bugs item #558988, was opened at 2002-05-21 23:15 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=100103&aid=558988&group_id=103
Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Norbert Bollow (bollow) Assigned to: Nobody/Anonymous (nobody) Summary: bad performance for big queue dirs
Initial Comment: Many filesystems (e.g. the popular ext2) have horrible performance when there are many files in the same directory. The queue system should be modified to avoid this situation. As a test case, try adding 20,000 test address in such a way that Mailman will try to send a welcome message to each of them.
Comment By: Barry Warsaw (bwarsaw) Date: 2002-05-22 22:08
Message: Logged In: YES user_id=12800
I think you have a valid complaint, but I'm loathe to change something so fundamental this late in the game. I'll leave this bug report open because if we can recommend some other, more big queue friendly filesystem (reiserfs? I don't know) then we should document this.
Don't MTAs have the same problem? Do they all implement multiple subdirectories for queued messages?
You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=100103&aid=558988&group_id=103
Many filesystems (e.g. the popular ext2) have horrible performance when there are many files in the same directory. The queue system should be modified to avoid this situation. As a test case, try adding 20,000 test address in such a way that Mailman will try to send a welcome message to each of them.
Isn't this optomising for a rather uncommon case. Typically the qfiles directory holds a couple of minutes of transactions plus messages awaiting moderation. [Actually that comment is somewhat Mailman 2.0.x centric although I think it will hold for later versions]
Don't MTAs have the same problem? Do they all implement multiple subdirectories for queued messages?
It is done in some MTAs - exim for example (as an option) - frankly for many cases the additional overhead of searching n directories outweighs the advantages of faster per message access *unless* you typically run huge queues (in which case there are other advantages like splitting the queue run).
Nigel.
-- [ Nigel Metheringham Nigel.Metheringham@InTechnology.co.uk ] [ - Comments in this message are my own and not ITO opinion/policy - ]
On 5/23/02 1:33 AM, "Nigel Metheringham" <Nigel.Metheringham@dev.InTechnology.co.uk> wrote:
Isn't this optomising for a rather uncommon case. Typically the qfiles directory holds a couple of minutes of transactions plus messages awaiting moderation.
Not on a large site, no.
But 2.1 changes the queueing enough that I'd put this on hold until we see if it's still a problem in 2.1. I certainly don't think it should be examined in 2.0.x, and we don't have any real info on whether 2.1 needs tweaking for really large/high-volume sites.
-- Chuq Von Rospach, Architech chuqui@plaidworks.com -- http://www.chuqui.com/
No! No! Dead girl, OFF the table! -- Shrek
Many filesystems (e.g. the popular ext2) have horrible performance when there are many files in the same directory. The queue system should be modified to avoid this situation. As a test case, try adding 20,000 test address in such a way that Mailman will try to send a welcome message to each of them.
Isn't this optomising for a rather uncommon case.
It may be an uncommon case, but if/when it happens, processing the queue becomes so slow that it's a real problem.
I've just taken some timings:
Adding 20,000 test addresses took about 40 minutes. (They're all addresses at a remote machine where the the whole domain is forwarded to /dev/null ). I consider this to be acceptable.
Now the welcome messages are going out, at a rate of one message every 6.6 seconds. This is on inexpensive hardware (a Cobalt RaQ3) but the server is otherwise idle. At this rate, it would take more than 36 hours to send them all out. I consider this to be unacceptably slow (but it's not catastrophic unless it's normal for your server to put ten or more messages into the virgin queue per minute on the avarage).
What I'm really concerned about is the possiblity of a similar thing happening on the bounces queue. On a busy system, you can easily have bounces coming in with a rate greater than 10 bounces per minute. Temporary network problems can easily result in you getting a large number of bounces at the same time. This could put so many bounces into the queue that processing bounces becomes slower than the rate at which new bounces arrive. Then the bounce-handler is permanently screwed.
Greetings, Norbert.
-- A founder of the http://DotGNU.org project and Steering Committee member Norbert Bollow, Weidlistr.18, CH-8624 Gruet (near Zurich, Switzerland) Tel +41 1 972 20 59 Fax +41 1 972 20 69 http://norbert.ch List hosting with GNU Mailman on your own domain name http://cisto.com
"NB" == Norbert Bollow <nb@thinkcoach.com> writes:
NB> Now the welcome messages are going out, at a rate of one
NB> message every 6.6 seconds. This is on inexpensive hardware (a
NB> Cobalt RaQ3) but the server is otherwise idle. At this rate,
NB> it would take more than 36 hours to send them all out. I
NB> consider this to be unacceptably slow (but it's not
NB> catastrophic unless it's normal for your server to put ten or
NB> more messages into the virgin queue per minute on the
NB> avarage).
There are a lot of unknowns here though. For example, we don't know if your rate limits are caused by MTA or network throttles. If your server is idle then i/o is suspect but we don't know if creating more outgoing qrunner processes would help you (by splitting up the queue hash space among parallel processes). Is your MTA spooling to the same filesystem that Mailman is spooling off of? If your MTA is throttling and it's running synchronously with the outgoing qrunner then Mailman may just be sitting around blocked on output to your MTA. Or maybe your disk subsystem can't take the pressure of Mailman reading off of it while your MTA is writing to it. What happens if you put them on different disks and/or controllers? And what effect does using some other filesystem (e.g. reiserfs) have on throughput?
NB> What I'm really concerned about is the possiblity of a similar
NB> thing happening on the bounces queue. On a busy system, you
NB> can easily have bounces coming in with a rate greater than 10
NB> bounces per minute. Temporary network problems can easily
NB> result in you getting a large number of bounces at the same
NB> time. This could put so many bounces into the queue that
NB> processing bounces becomes slower than the rate at which new
NB> bounces arrive. Then the bounce-handler is permanently
NB> screwed.
I don't think we can correlate your outgoing queue draining behavior with any supposed behavior of the internal bounce queue draining.
-Barry
On 5/27/02 12:16 PM, "Barry A. Warsaw" <barry@zope.com> wrote:
NB> Now the welcome messages are going out, at a rate of one NB> message every 6.6 seconds. This is on inexpensive hardware (a NB> Cobalt RaQ3) but the server is otherwise idle.
There are a lot of unknowns here though. For example, we don't know if your rate limits are caused by MTA or network throttles.
If it's a cobalt, ti's probably sendmail. If it's sendmail, it's probably the DNS-on-accept delays that sendmail has (unless you turn on deliverymode=defer, which disables relay checks). So I'm willing to bet there's a lot of DNS delay here.
You can get around this by setting up a special incoming port attached only to localhost so that nobody else can use it, and then you can use the defer mode on that port safely, and speed things up hugely.
Then you need to look at disk I/O and whether you're MTA is thrashing because of access contention on the mqueeu directory(s). I'd put the chance of qfile issues really low on the list of things to check here.
-- Chuq Von Rospach, Architech chuqui@plaidworks.com -- http://www.chuqui.com/
IMHO: Jargon. Acronym for In My Humble Opinion. Used to flag as an opinion something that is clearly from context an opinion to everyone except the mentally dense. Opinions flagged by IMHO are actually rarely humble. IMHO. (source: third unabridged dictionary of chuqui-isms).
Chuq Von Rospach <chuqui@plaidworks.com> wrote:
NB> Now the welcome messages are going out, at a rate of one NB> message every 6.6 seconds. This is on inexpensive hardware (a NB> Cobalt RaQ3) but the server is otherwise idle.
There are a lot of unknowns here though. For example, we don't know if your rate limits are caused by MTA or network throttles.
Trust me, they aren't. Whenever Mailman would pass one of the messages to the MTA, it would go out almost instantanously.
If it's a cobalt, it's probably sendmail.
It's qmail with a couple of patches, including big_concurrency, which isn't relevant for this test because the MTA gets one message per 6.6 seconds only.
Then you need to look at disk I/O and whether you're MTA is thrashing because of access contention on the mqueeu directory(s). I'd put the chance of qfile issues really low on the list of things to check here.
Well, I deleted those almost 40000 files from welcome messages (as I didn't have 36 hours of patience :-), added 20000 real subscribers (_witout_sending_them_welcome_messages_) and it all works ok now.
This issue is certainly related to the size of Mailman's queues, in this case the virgin queue. If it's not a matter of filesystem (in)efficiency, then it must be Python or Mailman doing something that scales extremely badly with queue size.
Greetings, Norbert.
-- A founder of the http://DotGNU.org project and Steering Committee member Norbert Bollow, Weidlistr.18, CH-8624 Gruet (near Zurich, Switzerland) Tel +41 1 972 20 59 Fax +41 1 972 20 69 http://norbert.ch List hosting with GNU Mailman on your own domain name http://cisto.com
On Mon, May 27, 2002 at 03:16:43PM -0400, Barry A. Warsaw wrote:
There are a lot of unknowns here though. For example, we don't know if your rate limits are caused by MTA or network throttles. If your server is idle then i/o is suspect but we don't know if creating more outgoing qrunner processes would help you (by splitting up the queue hash space among parallel processes). Is your MTA spooling to the same filesystem that Mailman is spooling off of? If your MTA is throttling and it's running synchronously with the outgoing qrunner then Mailman may just be sitting around blocked on output to your MTA. Or maybe your disk subsystem can't take the pressure of Mailman reading off of it while your MTA is writing to it. What happens if you put them on different disks and/or controllers? And what effect does using some other filesystem (e.g. reiserfs) have on throughput?
Hmmm... suggestion for future point releases: having some instrumentation that would make it easier to decide where the throttles *are* would be Really Useful. Mailman has lots of *configuration* knobs, but few *runtime* knobs, that I've seen, and almost *no* meters.
It's a Subsystem; it needs that stuff.
Cheers, -- jra
Jay R. Ashworth jra@baylink.com Member of the Technical Staff Baylink RFC 2100 The Suncoast Freenet The Things I Think Tampa Bay, Florida http://baylink.pitas.com +1 727 647 1274
"If you don't have a dream; how're you gonna have a dream come true?" -- Captain Sensible, The Damned (from South Pacific's "Happy Talk")
"JRA" == Jay R Ashworth <jra@baylink.com> writes:
JRA> Hmmm... suggestion for future point releases: having some
JRA> instrumentation that would make it easier to decide where the
JRA> throttles *are* would be Really Useful. Mailman has lots of
JRA> *configuration* knobs, but few *runtime* knobs, that I've
JRA> seen, and almost *no* meters.
JRA> It's a Subsystem; it needs that stuff.
I couldn't agree more if you paid me to add it. :) -Barry
participants (7)
-
barry@zope.com
-
Chuq Von Rospach
-
Jay R. Ashworth
-
Nigel Metheringham
-
Norbert Bollow
-
Norbert Bollow
-
noreply@sourceforge.net