Mailman 3 Options for increasing throughput - Mailman-Users

newer
Re: [Mailman-Users] Mailman 2.1.11...

Options for increasing throughput

older
Re: [Mailman-Users] Python process...

Fletcher Cocquyt

June 20, 2008

3:33 a.m.

Hi, I am observing periods of qfiles/in backlogs in the 400-600 message count range that take 1-2hours to clear with the standard Mailman 2.1.9 + Spamassassin (the vette log shows these messages process in an avg of ~10 seconds each)

Is there an easy way to parallelize what looks like a single serialized Mailman queue? I see some posts re: multi-slice but nothing definitive

I would also like the option of working this into an overall loadbalancing scheme where I have multiple smtp nodes behind an F5 loadbalancer and the nodes share an NFS backend...

Many thanks for pointers to any HOWTO docs or wiki notes on this

Fletcher.

Show replies by date

Mark Sapiro

June 2008

9:01 a.m.

Fletcher Cocquyt wrote:

...

Is Spamassassin invoked from Mailman or from the MTA before Mailman? If this plain Mailman, 10 seconds is a hugely long time to process a single post through IncomingRunner.

If you have some Spamassassin interface like <http://sourceforge.net/tracker/index.php?func=detail&aid=640518&group_id=103&atid=300103> that calls spamd from a Mailman handler, you might consider moving Spamassassin ahead of Mailman and using something like <http://sourceforge.net/tracker/index.php?func=detail&aid=840426&group_id=103&atid=300103> or just header_filter_rules instead.

...

See the section of Defaults.py headed with

##### # Qrunner defaults #####

In order to run multiple, parallel IncomingRunner processes, you can either copy the entire QRUNNERS definition from Defaults.py to mm_cfg.py and change

('IncomingRunner', 1), # posts from the outside world

('IncomingRunner', 4), # posts from the outside world

which says run 4 IncomingRunner processes, or you can just add something like

QRUNNERS[QRUNNERS.index(('IncomingRunner',1))] = ('IncomingRunner',4)

to mm_cfg.py. You can use any power of two for the number.

...

The following search will return some information.

<http://www.google.com/search?q=site%3Amail.python.org++inurl%3Amailman++%22l...>

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Fletcher Cocquyt

5:55 p.m.

Mike, many thanks for your (as always) very helpful response - I added the 1 liner to mm_cfg.py to increase the threads to 16. Now I am observing (via memory trend graphs) an acceleration of what looks like a memory leak - maybe from python - currently at 2.4

I am compiling the latest 2.5.2 to see if that helps - for now the workaround is to restart mailman occasionally.

(and yes the spamassassin checks are the source of the 4-10 second delay - now those happen in parallel x16 - so no spikes in the backlog...)

Thanks again

On 6/20/08 9:01 AM, "Mark Sapiro" <mark@msapiro.net> wrote:

...

Fletcher Cocquyt wrote:

...
Hi, I am observing periods of qfiles/in backlogs in the 400-600 message count range that take 1-2hours to clear with the standard Mailman 2.1.9 + Spamassassin (the vette log shows these messages process in an avg of ~10 seconds each)

Is Spamassassin invoked from Mailman or from the MTA before Mailman? If this plain Mailman, 10 seconds is a hugely long time to process a single post through IncomingRunner.

If you have some Spamassassin interface like <http://sourceforge.net/tracker/index.php?func=detail&aid=640518&group_id=103& atid=300103> that calls spamd from a Mailman handler, you might consider moving Spamassassin ahead of Mailman and using something like <http://sourceforge.net/tracker/index.php?func=detail&aid=840426&group_id=103& atid=300103> or just header_filter_rules instead.

...
Is there an easy way to parallelize what looks like a single serialized Mailman queue? I see some posts re: multi-slice but nothing definitive

See the section of Defaults.py headed with

##### # Qrunner defaults #####

In order to run multiple, parallel IncomingRunner processes, you can either copy the entire QRUNNERS definition from Defaults.py to mm_cfg.py and change
('IncomingRunner', 1), # posts from the outside world
to
('IncomingRunner', 4), # posts from the outside world
which says run 4 IncomingRunner processes, or you can just add something like

QRUNNERS[QRUNNERS.index(('IncomingRunner',1))] = ('IncomingRunner',4)

to mm_cfg.py. You can use any power of two for the number.

...
I would also like the option of working this into an overall loadbalancing scheme where I have multiple smtp nodes behind an F5 loadbalancer and the nodes share an NFS backend...

The following search will return some information.

<http://www.google.com/search?q=site%3Amail.python.org++inurl%3Amailman++%22l... ad+balancing%22>

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Brad Knowles

10:44 p.m.

On 6/23/08, Fletcher Cocquyt wrote:

...

(and yes the spamassassin checks are the source of the 4-10 second delay - now those happen in parallel x16 - so no spikes in the backlog...)

Search the FAQ for "performance". Do all such spam/virus/DNS/etc... checking up front, and run a second copy of your MTA with all these checks disabled. Have Mailman deliver to the second copy of the MTA, because you will have already done all this stuff on input.

There's no sense running the same message through SpamAssassin (or whatever) thousands of times, if you can do it once on input and then never again.

-- Brad Knowles <brad@shub-internet.org> LinkedIn Profile: <http://tinyurl.com/y8kpxu>

Fletcher Cocquyt

July 2008

9:19 a.m.

New subject: Python process size grows 30x in 8 hours (memory leak?)

An update - I've upgraded to the latest stable python (2.5.2) and its made no difference to the process growth: Config: Solaris 10 x86 Python 2.5.2 Mailman 2.1.9 (8 Incoming queue runners - the leak rate increases with this) SpamAssassin 3.2.5

At this point I am looking for ways to isolate the suspected memory leak - I am looking at using dtrace: http://blogs.sun.com/sanjeevb/date/200506

Any other tips appreciated!

Initial (immediately after a /etc/init.d/mailman restart): last pid: 10330; load averages: 0.45, 0.19, 0.15 09:13:33 93 processes: 92 sleeping, 1 on cpu CPU states: 98.6% idle, 0.4% user, 1.0% kernel, 0.0% iowait, 0.0% swap Memory: 1640M real, 1160M free, 444M swap in use, 2779M swap free

PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 10314 mailman 1 59 0 9612K 7132K sleep 0:00 0.35% python 10303 mailman 1 59 0 9604K 7080K sleep 0:00 0.15% python 10305 mailman 1 59 0 9596K 7056K sleep 0:00 0.14% python 10304 mailman 1 59 0 9572K 7036K sleep 0:00 0.14% python 10311 mailman 1 59 0 9572K 7016K sleep 0:00 0.13% python 10310 mailman 1 59 0 9572K 7016K sleep 0:00 0.13% python 10306 mailman 1 59 0 9556K 7020K sleep 0:00 0.14% python 10302 mailman 1 59 0 9548K 6940K sleep 0:00 0.13% python 10319 mailman 1 59 0 9516K 6884K sleep 0:00 0.15% python 10312 mailman 1 59 0 9508K 6860K sleep 0:00 0.12% python 10321 mailman 1 59 0 9500K 6852K sleep 0:00 0.14% python 10309 mailman 1 59 0 9500K 6852K sleep 0:00 0.13% python 10307 mailman 1 59 0 9500K 6852K sleep 0:00 0.13% python 10308 mailman 1 59 0 9500K 6852K sleep 0:00 0.12% python 10313 mailman 1 59 0 9500K 6852K sleep 0:00 0.12% python

After 8 hours: last pid: 9878; load averages: 0.14, 0.12, 0.13 09:12:18 97 processes: 96 sleeping, 1 on cpu CPU states: 97.2% idle, 1.2% user, 1.6% kernel, 0.0% iowait, 0.0% swap Memory: 1640M real, 179M free, 2121M swap in use, 1100M swap free

PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 10123 mailman 1 59 0 314M 311M sleep 1:57 0.02% python 10131 mailman 1 59 0 310M 307M sleep 1:35 0.01% python 10124 mailman 1 59 0 309M 78M sleep 0:45 0.10% python 10134 mailman 1 59 0 307M 81M sleep 1:27 0.01% python 10125 mailman 1 59 0 307M 79M sleep 0:42 0.01% python 10133 mailman 1 59 0 44M 41M sleep 0:14 0.01% python 10122 mailman 1 59 0 34M 30M sleep 0:43 0.39% python 10127 mailman 1 59 0 31M 27M sleep 0:40 0.26% python 10130 mailman 1 59 0 30M 26M sleep 0:15 0.03% python 10129 mailman 1 59 0 28M 24M sleep 0:19 0.10% python 10126 mailman 1 59 0 28M 25M sleep 1:07 0.59% python 10132 mailman 1 59 0 27M 24M sleep 1:00 0.46% python 10128 mailman 1 59 0 27M 24M sleep 0:16 0.01% python 10151 mailman 1 59 0 9516K 3852K sleep 0:05 0.01% python 10150 mailman 1 59 0 9500K 3764K sleep 0:00 0.00% python

On 6/23/08 8:55 PM, "Fletcher Cocquyt" <fcocquyt@stanford.edu> wrote:

...

Mike, many thanks for your (as always) very helpful response - I added the 1 liner to mm_cfg.py to increase the threads to 16. Now I am observing (via memory trend graphs) an acceleration of what looks like a memory leak - maybe from python - currently at 2.4

I am compiling the latest 2.5.2 to see if that helps - for now the workaround is to restart mailman occasionally.

(and yes the spamassassin checks are the source of the 4-10 second delay - now those happen in parallel x16 - so no spikes in the backlog...)

Thanks again

On 6/20/08 9:01 AM, "Mark Sapiro" <mark@msapiro.net> wrote:

...
Fletcher Cocquyt wrote:

...
Hi, I am observing periods of qfiles/in backlogs in the 400-600 message count range that take 1-2hours to clear with the standard Mailman 2.1.9 + Spamassassin (the vette log shows these messages process in an avg of ~10 seconds each)

Is Spamassassin invoked from Mailman or from the MTA before Mailman? If this plain Mailman, 10 seconds is a hugely long time to process a single post through IncomingRunner.

If you have some Spamassassin interface like

<http://sourceforge.net/tracker/index.php?func=detail&aid=640518&group_id=103>> &

...
atid=300103> that calls spamd from a Mailman handler, you might consider moving Spamassassin ahead of Mailman and using something like

<http://sourceforge.net/tracker/index.php?func=detail&aid=840426&group_id=103>> &

...
atid=300103> or just header_filter_rules instead.

...
Is there an easy way to parallelize what looks like a single serialized Mailman queue? I see some posts re: multi-slice but nothing definitive

See the section of Defaults.py headed with

##### # Qrunner defaults #####

In order to run multiple, parallel IncomingRunner processes, you can either copy the entire QRUNNERS definition from Defaults.py to mm_cfg.py and change
('IncomingRunner', 1), # posts from the outside world
to
('IncomingRunner', 4), # posts from the outside world
which says run 4 IncomingRunner processes, or you can just add something like

QRUNNERS[QRUNNERS.index(('IncomingRunner',1))] = ('IncomingRunner',4)

to mm_cfg.py. You can use any power of two for the number.

...
I would also like the option of working this into an overall loadbalancing scheme where I have multiple smtp nodes behind an F5 loadbalancer and the nodes share an NFS backend...

The following search will return some information.
<http://www.google.com/search?q=site%3Amail.python.org++inurl%3Amailman++%22l>> o

...
ad+balancing%22>

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Vidiot

10:09 a.m.

New subject: Python process size grows 30x in 8 hours (memory

...

I'd start by installing 2.1.11, which was just released yesterday.

MB

e-mail: vidiot@vidiot.com /~\ The ASCII [I've been to Earth. I know where it is. ] \ / Ribbon Campaign [And I'm gonna take us there. Starbuck 3/25/07] X Against Visit - URL: http://vidiot.com/ / \ HTML Email

Fletcher Cocquyt

12:56 p.m.

New subject: Python process size grows 30x in 8 hours (memory

I'm having a hard time finding the release notes for 2.1.11 - can you please provide a link? (I want to see where it details any memory leak fixes since 2.1.9)

thanks

On 7/1/08 10:09 AM, "Vidiot" <brown@mrvideo.vidiot.com> wrote:

...

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Vidiot

1:05 p.m.

New subject: Python process size grows 30x in 8 hours (memory

...

There should be one on the list.org website. If not, I do not know where it is. Should also be in the package.

MB

Fletcher Cocquyt

1:28 p.m.

New subject: Python process size grows 30x in 8 hours (memory

Not finding a "leak" ref - save a irrelevant (for this runner issue) admindb one:

god@irt-smtp-02:mailman-2.1.11 1:26pm 58 # ls ACKNOWLEDGMENTS Mailman README-I18N.en STYLEGUIDE.txt configure doc misc templates BUGS Makefile.in README.CONTRIB TODO configure.in gnu-COPYING-GPL mkinstalldirs tests FAQ NEWS README.NETSCAPE UPGRADING contrib install-sh scripts INSTALL README README.USERAGENT bin cron messages src god@irt-smtp-02:mailman-2.1.11 1:26pm 59 # egrep -i leak * NEWS: (Tokio Kikuchi's i18n patches), 862906 (unicode prefix leak in admindb),

Thanks

On 7/1/08 1:05 PM, "Vidiot" <brown@mrvideo.vidiot.com> wrote:

...

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Mark Sapiro

3:37 p.m.

New subject: Python process size grows 30x in 8 hours (memory

Fletcher Cocquyt wrote:

...

Not finding a "leak" ref - save a irrelevant (for this runner issue) admindb

Nothing has been done in Mailman to fix any memory leaks. As far as I know, nothing has been done to create any either.

If there is a leak, it is most likely in the underlying Python and not a Mailman issue per se.

I am curious. You say this problem was exacerbated when you went from one IncomingRunner to eight (sliced) IncomingRunners. The IncomingRunner instances themselves should be processing fewer messages each, and I would expect them to leak less. The other runners are doing the same as before so I would expect them to be the same unless by solving your 'in' queue backlog, you're just handling a whole lot more messages.

Also, in an 8 hour period, I would expect that RetryRunner and CommandRunner and, unless you are doing a lot of mail -> news gatewaying, NewsRunner to have done virtually nothing.

In this snapshot

Which processes correspond to which runners. And why are the two processes that have apparently done the least the ones that have grown the most.

In fact, why are none of these 15 PIDs the same as the ones from 8 hours earlier, or was that snapshot actually from after the above were restarted?

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Fletcher Cocquyt

4:20 p.m.

New subject: Python process size grows 30x in 8 hours (memory

On 7/1/08 3:37 PM, "Mark Sapiro" <mark@msapiro.net> wrote:

...

Fletcher Cocquyt wrote:

...
Not finding a "leak" ref - save a irrelevant (for this runner issue) admindb

Nothing has been done in Mailman to fix any memory leaks. As far as I know, nothing has been done to create any either.

Ok, thanks for confirming that - I will not prioritize a mailman 2.1.9->2.1.11 upgrade

...

If there is a leak, it is most likely in the underlying Python and not a Mailman issue per se.

Agreed - hence my first priority to upgrade from python 2.4.x to 2.5.2 (the latest on python.org) - but upgrading did not help this

...

I am curious. You say this problem was exacerbated when you went from one IncomingRunner to eight (sliced) IncomingRunners. The IncomingRunner instances themselves should be processing fewer messages each, and I would expect them to leak less. The other runners are doing the same as before so I would expect them to be the same unless by solving your 'in' queue backlog, you're just handling a whole lot more messages.

...

Also, in an 8 hour period, I would expect that RetryRunner and CommandRunner and, unless you are doing a lot of mail -> news gatewaying, NewsRunner to have done virtually nothing.

In this snapshot

PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 10123 mailman 1 59 0 314M 311M sleep 1:57 0.02% python 10131 mailman 1 59 0 310M 307M sleep 1:35 0.01% python 10124 mailman 1 59 0 309M 78M sleep 0:45 0.10% python 10134 mailman 1 59 0 307M 81M sleep 1:27 0.01% python 10125 mailman 1 59 0 307M 79M sleep 0:42 0.01% python 10133 mailman 1 59 0 44M 41M sleep 0:14 0.01% python 10122 mailman 1 59 0 34M 30M sleep 0:43 0.39% python 10127 mailman 1 59 0 31M 27M sleep 0:40 0.26% python 10130 mailman 1 59 0 30M 26M sleep 0:15 0.03% python 10129 mailman 1 59 0 28M 24M sleep 0:19 0.10% python 10126 mailman 1 59 0 28M 25M sleep 1:07 0.59% python 10132 mailman 1 59 0 27M 24M sleep 1:00 0.46% python 10128 mailman 1 59 0 27M 24M sleep 0:16 0.01% python 10151 mailman 1 59 0 9516K 3852K sleep 0:05 0.01% python 10150 mailman 1 59 0 9500K 3764K sleep 0:00 0.00% python

Which processes correspond to which runners. And why are the two processes that have apparently done the least the ones that have grown the most.

In fact, why are none of these 15 PIDs the same as the ones from 8 hours earlier, or was that snapshot actually from after the above were restarted? Yes, I snapshot'ed the current leaked state, then restarted and snapped those new PIDs to show the size diff.

Here is the current leaked state since the the cron 13:27 restart only 3 hours ago: last pid: 20867; load averages: 0.53, 0.47, 0.24 16:04:15 91 processes: 90 sleeping, 1 on cpu CPU states: 99.1% idle, 0.3% user, 0.6% kernel, 0.0% iowait, 0.0% swap Memory: 1640M real, 77M free, 1509M swap in use, 1699M swap free

PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 24167 mailman 1 59 0 311M 309M sleep 0:28 0.02% python 24158 mailman 1 59 0 308M 305M sleep 0:30 0.01% python 24169 mailman 1 59 0 303M 301M sleep 0:28 0.01% python 24165 mailman 1 59 0 29M 27M sleep 0:09 0.03% python 24161 mailman 1 59 0 29M 27M sleep 0:12 0.07% python 24164 mailman 1 59 0 28M 26M sleep 0:07 0.01% python 24172 mailman 1 59 0 26M 24M sleep 0:04 0.01% python 24160 mailman 1 59 0 26M 24M sleep 0:08 0.01% python 24162 mailman 1 59 0 26M 23M sleep 0:10 0.01% python 24166 mailman 1 59 0 26M 23M sleep 0:04 0.01% python 24171 mailman 1 59 0 25M 23M sleep 0:04 0.02% python 24163 mailman 1 59 0 24M 22M sleep 0:04 0.01% python 24168 mailman 1 59 0 19M 17M sleep 0:03 0.02% python 24170 mailman 1 59 0 9516K 6884K sleep 0:01 0.01% python 24159 mailman 1 59 0 9500K 6852K sleep 0:00 0.00% python

And the mapping to the runners: god@irt-smtp-02:mailman-2.1.11 4:16pm 66 # /usr/ucb/ps auxw | egrep mailman | awk '{print $2 " " $11}' 24167 --runner=IncomingRunner:5:8 24165 --runner=BounceRunner:0:1 24158 --runner=IncomingRunner:7:8 24162 --runner=VirginRunner:0:1 24163 --runner=IncomingRunner:1:8 24166 --runner=IncomingRunner:0:8 24168 --runner=IncomingRunner:4:8 24169 --runner=IncomingRunner:2:8 24171 --runner=IncomingRunner:6:8 24172 --runner=IncomingRunner:3:8 24160 --runner=CommandRunner:0:1 24161 --runner=OutgoingRunner:0:1 24164 --runner=ArchRunner:0:1 24170 /bin/python 24159 /bin/python

Thanks for the analysis, Fletcher

Mark Sapiro

6:14 p.m.

New subject: Python process size grows 30x in 8 hours (memory

Fletcher Cocquyt wrote:

...

Here is the current leaked state since the the cron 13:27 restart only 3 hours ago: last pid: 20867; load averages: 0.53, 0.47, 0.24 16:04:15 91 processes: 90 sleeping, 1 on cpu CPU states: 99.1% idle, 0.3% user, 0.6% kernel, 0.0% iowait, 0.0% swap Memory: 1640M real, 77M free, 1509M swap in use, 1699M swap free

PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 24167 mailman 1 59 0 311M 309M sleep 0:28 0.02% python 24158 mailman 1 59 0 308M 305M sleep 0:30 0.01% python 24169 mailman 1 59 0 303M 301M sleep 0:28 0.01% python 24165 mailman 1 59 0 29M 27M sleep 0:09 0.03% python 24161 mailman 1 59 0 29M 27M sleep 0:12 0.07% python 24164 mailman 1 59 0 28M 26M sleep 0:07 0.01% python 24172 mailman 1 59 0 26M 24M sleep 0:04 0.01% python 24160 mailman 1 59 0 26M 24M sleep 0:08 0.01% python 24162 mailman 1 59 0 26M 23M sleep 0:10 0.01% python 24166 mailman 1 59 0 26M 23M sleep 0:04 0.01% python 24171 mailman 1 59 0 25M 23M sleep 0:04 0.02% python 24163 mailman 1 59 0 24M 22M sleep 0:04 0.01% python 24168 mailman 1 59 0 19M 17M sleep 0:03 0.02% python 24170 mailman 1 59 0 9516K 6884K sleep 0:01 0.01% python 24159 mailman 1 59 0 9500K 6852K sleep 0:00 0.00% python

And the mapping to the runners: god@irt-smtp-02:mailman-2.1.11 4:16pm 66 # /usr/ucb/ps auxw | egrep mailman | awk '{print $2 " " $11}' 24167 --runner=IncomingRunner:5:8 24165 --runner=BounceRunner:0:1 24158 --runner=IncomingRunner:7:8 24162 --runner=VirginRunner:0:1 24163 --runner=IncomingRunner:1:8 24166 --runner=IncomingRunner:0:8 24168 --runner=IncomingRunner:4:8 24169 --runner=IncomingRunner:2:8 24171 --runner=IncomingRunner:6:8 24172 --runner=IncomingRunner:3:8 24160 --runner=CommandRunner:0:1 24161 --runner=OutgoingRunner:0:1 24164 --runner=ArchRunner:0:1 24170 /bin/python 24159 /bin/python

What are these last 2? Presumably they are the missing NewsRunner and RetryRunner, but what is the extra stuff in the ps output causing $11 to be the python command and not the runner option? And again, why are these two, which presumably have done nothing, seemingly the biggest.

Here's some additional thought.

Are you sure there is an actual leak? Do you know that if you just let them run, they don't reach some stable size and remain there as opposed to growing so large that they eventually throw a MemoryError exception and get restarted by mailmanctl.

If you allowed them to do that once, the MemoryError traceback might provide a clue.

Caveat! I know very little about Python's memory management. Some of what follows may be wrong.

Here's what I think - Python allocates more memory (from the OS) as needed to import additional modules and create new objects. Imports don't go away, but objects that are destroyed or become unreachable (eg a file object that is closed or a message object whose only reference gets assigned to something else) become candidates for garbage collection and ultimately the memory allocated to them is collected and reused (assuming no leaks). I *think* however, that no memory is ever actually freed back to the OS. Thus, Python processes that run for a long time can grow, but don't shrink.

Now, IncomingRunner in particular can get very large if large messages are arriving, even if those messages are ultimately not processed very far. Incoming runner reads the entire message into memory and then parses it into a message object which is even bigger than the message string. So, if someone happens to send a 100MB attachment to a list, IncomingRunner is going to need over 200MB before it ever looks at the message itself. This memory will later become available for other use within that IncomingRunner instance, but I don't think it is ever freed back to the OS.

Also, I see very little memory change between the 3 hour old snapshot above and the 8 hour old one from your prior post. If this is really a memory leak, I'd expect the 8 hour old ones to be perhaps twice as big as the 3 hour old ones.

Also, do you have any really big lists with big config.pck files. If so, Runners will grow as they instantiate that (those) big list(s).

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Mark Sapiro

7:16 p.m.

New subject: Python process size grows 30x in 8 hours (memory

Mark Sapiro wrote:

...

and

...

Doh? I finally noticed these are in K and the others are in M so that question is answered at least - the two that haven't done anything actually haven't grown.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Fletcher Cocquyt

8:26 p.m.

New subject: Python process size grows 30x in 8 hours (memory

Pmap shows its the heap

god@irt-smtp-02:in 8:08pm 64 # pmap 24167 24167: /bin/python /opt/mailman-2.1.9/bin/qrunner --runner=IncomingRunner:5:8 08038000 64K rwx-- [ stack ] 08050000 940K r-x-- /usr/local/stow/Python-2.5.2/bin/python 0814A000 172K rwx-- /usr/local/stow/Python-2.5.2/bin/python 08175000 312388K rwx-- [ heap ] CF210000 64K rwx-- [ anon ] <--many small libs --> total 318300K

Whether its a leak or not - we need to understand why the heap is growing and put a limit on its growth to avoid exausting memory and swapping into oblivion...

None of the lists seem too big: god@irt-smtp-02:lists 8:24pm 73 # du -sk */*pck | sort -nr | head | awk '{print $1}' 1392 1240 1152 1096 912 720 464 168 136 112

Researching python heap alloaction....

thanks

On 7/1/08 6:14 PM, "Mark Sapiro" <mark@msapiro.net> wrote:

...

Fletcher Cocquyt wrote:

...
Here is the current leaked state since the the cron 13:27 restart only 3 hours ago: last pid: 20867; load averages: 0.53, 0.47, 0.24 16:04:15 91 processes: 90 sleeping, 1 on cpu CPU states: 99.1% idle, 0.3% user, 0.6% kernel, 0.0% iowait, 0.0% swap Memory: 1640M real, 77M free, 1509M swap in use, 1699M swap free

PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 24167 mailman 1 59 0 311M 309M sleep 0:28 0.02% python 24158 mailman 1 59 0 308M 305M sleep 0:30 0.01% python 24169 mailman 1 59 0 303M 301M sleep 0:28 0.01% python 24165 mailman 1 59 0 29M 27M sleep 0:09 0.03% python 24161 mailman 1 59 0 29M 27M sleep 0:12 0.07% python 24164 mailman 1 59 0 28M 26M sleep 0:07 0.01% python 24172 mailman 1 59 0 26M 24M sleep 0:04 0.01% python 24160 mailman 1 59 0 26M 24M sleep 0:08 0.01% python 24162 mailman 1 59 0 26M 23M sleep 0:10 0.01% python 24166 mailman 1 59 0 26M 23M sleep 0:04 0.01% python 24171 mailman 1 59 0 25M 23M sleep 0:04 0.02% python 24163 mailman 1 59 0 24M 22M sleep 0:04 0.01% python 24168 mailman 1 59 0 19M 17M sleep 0:03 0.02% python 24170 mailman 1 59 0 9516K 6884K sleep 0:01 0.01% python 24159 mailman 1 59 0 9500K 6852K sleep 0:00 0.00% python

And the mapping to the runners: god@irt-smtp-02:mailman-2.1.11 4:16pm 66 # /usr/ucb/ps auxw | egrep mailman | awk '{print $2 " " $11}' 24167 --runner=IncomingRunner:5:8 24165 --runner=BounceRunner:0:1 24158 --runner=IncomingRunner:7:8 24162 --runner=VirginRunner:0:1 24163 --runner=IncomingRunner:1:8 24166 --runner=IncomingRunner:0:8 24168 --runner=IncomingRunner:4:8 24169 --runner=IncomingRunner:2:8 24171 --runner=IncomingRunner:6:8 24172 --runner=IncomingRunner:3:8 24160 --runner=CommandRunner:0:1 24161 --runner=OutgoingRunner:0:1 24164 --runner=ArchRunner:0:1 24170 /bin/python 24159 /bin/python

What are these last 2? Presumably they are the missing NewsRunner and RetryRunner, but what is the extra stuff in the ps output causing $11 to be the python command and not the runner option? And again, why are these two, which presumably have done nothing, seemingly the biggest.

Here's some additional thought.

Are you sure there is an actual leak? Do you know that if you just let them run, they don't reach some stable size and remain there as opposed to growing so large that they eventually throw a MemoryError exception and get restarted by mailmanctl.

If you allowed them to do that once, the MemoryError traceback might provide a clue.

Caveat! I know very little about Python's memory management. Some of what follows may be wrong.

Here's what I think - Python allocates more memory (from the OS) as needed to import additional modules and create new objects. Imports don't go away, but objects that are destroyed or become unreachable (eg a file object that is closed or a message object whose only reference gets assigned to something else) become candidates for garbage collection and ultimately the memory allocated to them is collected and reused (assuming no leaks). I *think* however, that no memory is ever actually freed back to the OS. Thus, Python processes that run for a long time can grow, but don't shrink.

Now, IncomingRunner in particular can get very large if large messages are arriving, even if those messages are ultimately not processed very far. Incoming runner reads the entire message into memory and then parses it into a message object which is even bigger than the message string. So, if someone happens to send a 100MB attachment to a list, IncomingRunner is going to need over 200MB before it ever looks at the message itself. This memory will later become available for other use within that IncomingRunner instance, but I don't think it is ever freed back to the OS.

Also, I see very little memory change between the 3 hour old snapshot above and the 8 hour old one from your prior post. If this is really a memory leak, I'd expect the 8 hour old ones to be perhaps twice as big as the 3 hour old ones.

Also, do you have any really big lists with big config.pck files. If so, Runners will grow as they instantiate that (those) big list(s).

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Mark Sapiro

9:22 p.m.

New subject: Python process size grows 30x in 8 hours (memory

Fletcher Cocquyt wrote:

...

At this point, I don't think it's a leak.

Your runners start out at about 9.5 MB. Most of your working runners grow to about the 20-40 MB range which I don't think is unusual for a site with some config.pck files approaching 1.4 MB.

Only your IncomingRunners seem to grow really big, and I think that is because you are seeing occasional, very large messages, or perhaps it has something to do with your custom spam filtering interface.

Does your MTA limit incoming message size?

In any case, I know you're reluctant to just let it run, but I think if you did let it run for a couple of days that the IncomingRunners wouldn't get any bigger than the 310 +- MB that you're already seeing after 3 hours, and the rest of the runners would remain in the 10 - 50 MB range.

I don't think you'll see a lot of paging activity of that 300+MB because I suspect that most of the time nothing is going on in most of that memory

...

You may also be interested in the FAQ article at <http://wiki.list.org/x/94A9>.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Brad Knowles

9:48 p.m.

New subject: Python process size grows 30x in 8 hours (memory

On 7/1/08, Mark Sapiro wrote:

...

The thing I've discovered when doing detailed memory/performance analysis of Mailman queue runners in the past is that, by far, the vast amount of memory that is in use is actually shared across all the processes, so in this case you'd only take that ~310MB hit once.

Some OSes make it more clear than others that this memory is being shared, and conversely some OSes appear to count this shared memory as actually belonging to multiple separate processes and end up vastly overstating the amount of real memory that is being allocated.

...

You may also be interested in the FAQ article at <http://wiki.list.org/x/94A9>.

It would be interesting to have all those same commands run for the system in question, to compare with the numbers in the FAQ.

-- Brad Knowles <brad@shub-internet.org> LinkedIn Profile: <http://tinyurl.com/y8kpxu>

Brad Knowles

10:14 p.m.

New subject: Python process size grows 30x in 8 hours (memory

On 7/1/08, Fletcher Cocquyt wrote:

...

And when I do the same thing on the mail server for python.org (which hosts over 100 lists, including some pretty active lists with large numbers of subscribers), on the largest queue runner we have (ArchRunner at 41m), I see:

# pmap 1040 | sort -nr -k 2 | head total 45800K 0815f000 23244K rwx-- [ anon ] 40f61000 4420K rw--- [ anon ] 40a0f000 2340K rw--- [ anon ] 408aa000 1300K rw--- [ anon ] 40745000 1300K rw--- [ anon ] 40343000 1160K r-x-- /usr/lib/i686/cmov/libcrypto.so.0.9.8 4009c000 1092K r-x-- /lib/libc-2.3.6.so 41844000 1040K rw--- [ anon ] 08048000 944K r-x-- /usr/local/bin/python

No heap showing up anywhere. Doing the same for our IncomingRunner, I get:

# pmap 1043 | sort -nr -k 2 | head total 23144K 0815f000 7740K rwx-- [ anon ] 40b12000 1560K rw--- [ anon ] 40745000 1300K rw--- [ anon ] 40cb8000 1168K rw--- [ anon ] 40347000 1160K r-x-- /usr/lib/i686/cmov/libcrypto.so.0.9.8 4009c000 1092K r-x-- /lib/libc-2.3.6.so 4098d000 1040K rw--- [ anon ] 08048000 944K r-x-- /usr/local/bin/python 4063b000 936K rw--- [ anon ]

Again, no heap.

...

Where did you do this? In the /usr/local/mailman directory?

When I did this in /usr/local/mailman, all of the .pck files that showed up were actually held messages in the data/ directory, not in lists/. This would mean that they were individual messages that had been pickled and then held for moderation, not pickles for lists.

Doing the same in /usr/local/mailman/lists, I find that one of our smaller mailing lists (python-help, seventeen recipients) has the largest list pickle (1044 kilobytes). We have a total of 150 lists, and here's the current subscription count of the five biggest lists:

4075 Python-list
3305 Tutor
2600 Mailman-Users
2329 Mailman-announce
1528 Python-announce-list

Of these, python-list and tutor frequently gets between twenty to a hundred or more messages in a day. However, here's their respective list.pck files, using the same "du -sk" script from above:

904 tutor/config.pck 652 python-list/config.pck 476 mailman-users/config.pck 324 mailman-announce/config.pck 208 python-announce-list/config.pck

-- Brad Knowles <brad@shub-internet.org> LinkedIn Profile: <http://tinyurl.com/y8kpxu>

Brad Knowles

10:21 p.m.

New subject: Python process size grows 30x in 8 hours (memory

On 7/1/08, Fletcher Cocquyt wrote:

...

BTW, in case it hasn't come through yet -- I am very sensitive to your issues. In my "real" life, I am currently employed as a Sr. System Administrator at the University of Texas at Austin, with about ~50,000 students and ~20,000 faculty and staff, and one of my jobs is helping out with both the mail system administration and the mailing list system administration.

So, just because I post messages quoting the current statistics we're seeing on python.org, that doesn't mean I'm not sensitive to the problems you're seeing. All I'm saying is that we're not currently seeing them on python.org, so it may be a bit more difficult for us to directly answer your questions, although we'll certainly do everything we can do help.

-- Brad Knowles <brad@shub-internet.org> LinkedIn Profile: <http://tinyurl.com/y8kpxu>

Brad Knowles

9:58 p.m.

New subject: Python process size grows 30x in 8 hours (memory

On 7/1/08, Mark Sapiro wrote:

...

In contrast, the mail server for python.org shows the following:

top - 06:54:48 up 29 days, 9:09, 4 users, load average: 1.05, 1.08, 0.95 Tasks: 151 total, 1 running, 149 sleeping, 0 stopped, 1 zombie Cpu(s): 0.2% user, 1.1% system, 0.0% nice, 98.7% idle

PID USER PR VIRT NI RES SHR S %CPU TIME+ %MEM COMMAND 1040 mailman 9 42960 0 41m 12m S 0 693:59.44 2.1 ArchRunner:0:1 -s 1041 mailman 9 22876 0 20m 7488 S 0 478:18.62 1.0 BounceRunner:0:1 1045 mailman 9 20412 0 19m 10m S 0 3031:12 0.9 OutgoingRunner:0: 1043 mailman 9 20476 0 18m 4968 S 0 127:02.62 0.9 IncomingRunner:0: 1042 mailman 9 18564 0 17m 7316 S 0 11:34.14 0.9 CommandRunner:0:1 1046 mailman 11 17276 0 15m 10m S 1 66:32.16 0.8 VirginRunner:0:1 1044 mailman 9 11568 0 9964 5184 S 0 12:34.04 0.5 NewsRunner:0:1 -s

And those are the only Python-related processes that show up in the first twenty lines.

-- Brad Knowles <brad@shub-internet.org> LinkedIn Profile: <http://tinyurl.com/y8kpxu>

Fletcher Cocquyt

11:05 p.m.

New subject: Python process size grows 30x in 8 hours (memory

I did a test - I disabled the SpamAssassin integration and watched the heap grow steadily - I do not believe its SA related:

god@irt-smtp-02:mailman-2.1.9 10:51pm 68 # pmap 22804 | egrep heap 08175000 14060K rwx-- [ heap ] god@irt-smtp-02:mailman-2.1.9 10:51pm 69 # pmap 22804 | egrep heap 08175000 16620K rwx-- [ heap ] god@irt-smtp-02:mailman-2.1.9 10:52pm 70 # pmap 22804 | egrep heap 08175000 16620K rwx-- [ heap ] god@irt-smtp-02:mailman-2.1.9 10:53pm 75 # pmap 22804 | egrep heap 08175000 18924K rwx-- [ heap ] god@irt-smtp-02:mailman-2.1.9 10:54pm 81 # pmap 22804 | egrep heap 08175000 19692K rwx-- [ heap ] god@irt-smtp-02:mailman-2.1.9 10:55pm 82 # pmap 22804 | egrep heap 08175000 19692K rwx-- [ heap ]

Trying to find a way to look at the contents of the heap or at least limit its growth. Or is there not a way expire & restart mailman processes analogous to the apache httpd process expiration (designed to mitigate this kind of resource growth over time)?

thanks

On 7/1/08 9:58 PM, "Brad Knowles" <brad@shub-internet.org> wrote:

...

On 7/1/08, Mark Sapiro wrote:

...
In this snapshot

PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 10123 mailman 1 59 0 314M 311M sleep 1:57 0.02% python 10131 mailman 1 59 0 310M 307M sleep 1:35 0.01% python 10124 mailman 1 59 0 309M 78M sleep 0:45 0.10% python 10134 mailman 1 59 0 307M 81M sleep 1:27 0.01% python 10125 mailman 1 59 0 307M 79M sleep 0:42 0.01% python 10133 mailman 1 59 0 44M 41M sleep 0:14 0.01% python 10122 mailman 1 59 0 34M 30M sleep 0:43 0.39% python 10127 mailman 1 59 0 31M 27M sleep 0:40 0.26% python 10130 mailman 1 59 0 30M 26M sleep 0:15 0.03% python 10129 mailman 1 59 0 28M 24M sleep 0:19 0.10% python 10126 mailman 1 59 0 28M 25M sleep 1:07 0.59% python 10132 mailman 1 59 0 27M 24M sleep 1:00 0.46% python 10128 mailman 1 59 0 27M 24M sleep 0:16 0.01% python 10151 mailman 1 59 0 9516K 3852K sleep 0:05 0.01% python 10150 mailman 1 59 0 9500K 3764K sleep 0:00 0.00% python

Which processes correspond to which runners. And why are the two processes that have apparently done the least the ones that have grown the most.

In contrast, the mail server for python.org shows the following:

top - 06:54:48 up 29 days, 9:09, 4 users, load average: 1.05, 1.08, 0.95 Tasks: 151 total, 1 running, 149 sleeping, 0 stopped, 1 zombie Cpu(s): 0.2% user, 1.1% system, 0.0% nice, 98.7% idle

PID USER PR VIRT NI RES SHR S %CPU TIME+ %MEM COMMAND 1040 mailman 9 42960 0 41m 12m S 0 693:59.44 2.1 ArchRunner:0:1 -s 1041 mailman 9 22876 0 20m 7488 S 0 478:18.62 1.0 BounceRunner:0:1 1045 mailman 9 20412 0 19m 10m S 0 3031:12 0.9 OutgoingRunner:0: 1043 mailman 9 20476 0 18m 4968 S 0 127:02.62 0.9 IncomingRunner:0: 1042 mailman 9 18564 0 17m 7316 S 0 11:34.14 0.9 CommandRunner:0:1 1046 mailman 11 17276 0 15m 10m S 1 66:32.16 0.8 VirginRunner:0:1 1044 mailman 9 11568 0 9964 5184 S 0 12:34.04 0.5 NewsRunner:0:1 -s

And those are the only Python-related processes that show up in the first twenty lines.

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Mark Sapiro

8:15 a.m.

New subject: Python process size grows 30x in 8 hours (memory

Fletcher Cocquyt wrote:

...

I did a test - I disabled the SpamAssassin integration and watched the heap grow steadily - I do not believe its SA related:

OK.

Does your MTA limit the size of incoming messages? Can it?

At some point in the next day or so, I'm going to make a modified scripts/post script which will queue incoming messages in qfiles/bad and then move them to qfiles/in only if they are under a certain size. I'm really curious to see if that will help.

...

bin/mailmanctl could be modified to do this automatically, but currently only does it on command (restart) or signal (SIGINT), but I gather you're already running a cron that does a periodic restart.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Fletcher Cocquyt

8:55 a.m.

New subject: Python process size grows 30x in 8 hours (memory

On 7/2/08 8:15 AM, "Mark Sapiro" <mark@msapiro.net> wrote:

...

Yes, having a global incoming maxmessagesize limit and handler (what will the sender receive back?) for mailman would be useful.

...

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Mark Sapiro

8:01 p.m.

New subject: Python process size grows 30x in 8 hours (memory

Fletcher Cocquyt wrote:

...

The attached 'post' file is a modified version of scripts/post.

It does the following compared to the normal script.

The normal script reads the message from the pipe from the MTA and queues it in the 'in' queue for processing by an IncomingRunner. This script receives the message and instead queues it in the 'bad' queue. It then looks at the size of the 'bad' queue entry (a Python pickle that will be just slightly larger than the message text). If the size is less than MAXSIZE bytes (a parameter near the beginning of the script, currently set to 1000000, but which you can change as you desire), it moves the queue entry from the 'bad' queue to the 'in' queue for processing.

The end result is queue entries smaller than MAXSIZE will be processed normally, and entries >= MAXSIZE will be left in the 'bad' queue for manual examination (with bin/dumpdb or bin/show_qfiles) and either manual deletion or manual moving to the 'in' queue for processing.

The delivery is accepted by the MTA in either case so the poster sees nothing unusual.

This is not intended to be used in a normal production environment. It is only intended as a debug aid to see if IncomingRunners will not grow so large if incoming message size is limited.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Barry Warsaw

8:05 p.m.

New subject: Python process size grows 30x in 8 hours (memory

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 2, 2008, at 11:01 PM, Mark Sapiro wrote:

...

The attached 'post' file is a modified version of scripts/post.

Hi Mark, there was no attachment.

...

I'm not sure 'bad' should be used. Perhaps a separate queue called
'raw'? It is nice that files > MAXSIZE need only be left in 'bad'.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAkhsQekACgkQ2YZpQepbvXEBPQCfUUH1ZxUkzXUVfkPF0iZ5c2sK JPMAoJiJNehIX+E24fyYeAQMbKkwI2Kv =MF40 -----END PGP SIGNATURE-----

Mark Sapiro

8:15 p.m.

New subject: Python process size grows 30x in 8 hours (memory

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Barry Warsaw wrote: | On Jul 2, 2008, at 11:01 PM, Mark Sapiro wrote: | |> The attached 'post' file is a modified version of scripts/post. | | Hi Mark, there was no attachment.

Yes, I know. I was just about to resend. It is attached here. The MUA I used to send the previous message gives any attachment without an extension Content-Type: application/octet-stream, so the list's content filtering removed it.

|> It does the following compared to the normal script. | |> The normal script reads the message from the pipe from the MTA and |> queues it in the 'in' queue for processing by an IncomingRunner. This |> script receives the message and instead queues it in the 'bad' queue. |> It then looks at the size of the 'bad' queue entry (a Python pickle |> that will be just slightly larger than the message text). If the size |> is less than MAXSIZE bytes (a parameter near the beginning of the |> script, currently set to 1000000, but which you can change as you |> desire), it moves the queue entry from the 'bad' queue to the 'in' |> queue for processing. | | I'm not sure 'bad' should be used. Perhaps a separate queue called | 'raw'? It is nice that files > MAXSIZE need only be left in 'bad'.

If we're going to do something like this going forward, we can certainly change the queue. For this 'debug' effort, I wanted to keep it simple and use an existing mm_cfg queue name.

Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32)

iD8DBQFIbERTVVuXXpU7hpMRAtD3AJ4wak9befESKQlF3t2ZKos9W2WuTQCfbOCB Yh9VIStJMHWfiLVlYjM5uoo= =bU1+ -----END PGP SIGNATURE-----

# -*- python -*- # # Copyright (C) 1998,1999,2000,2001,2002 by the Free Software Foundation, Inc. # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.

"""Accept posts to a list and handle them properly.

The main advertised address for a list should be filtered to this program, through the mail wrapper. E.g. for list test@yourdomain.com', the test' alias would deliver to this script.

Stdin is the mail message, and argv[1] is the name of the target mailing list.

"""

import os import sys

import paths from Mailman import mm_cfg from Mailman import Utils from Mailman.i18n import _ from Mailman.Queue.sbcache import get_switchboard from Mailman.Logging.Utils import LogStdErr

LogStdErr("error", "post")

MAXSIZE = 1000000

def main(): # TBD: If you've configured your list or aliases so poorly as to get # either of these first two errors, there's little that can be done to # save your messages. They will be lost. Minimal testing of new lists # should avoid either of these problems. try: listname = sys.argv[1] except IndexError: print >> sys.stderr, _('post script got no listname.') sys.exit(1) # Make sure the list exists if not Utils.list_exists(listname): print >> sys.stderr, _('post script, list not found: %(listname)s') sys.exit(1) # Immediately queue the message for the incoming qrunner to process. The # advantage to this approach is that messages should never get lost -- # some MTAs have a hard limit to the time a filter prog can run. Postfix # is a good example; if the limit is hit, the proc is SIGKILL'd giving us # no chance to save the message. bdq = get_switchboard(mm_cfg.BADQUEUE_DIR) filebase = bdq.enqueue(sys.stdin.read(), listname=listname, tolist=1, _plaintext=1) frompath= os.path.join(mm_cfg.BADQUEUE_DIR, filebase + '.pck') topath= os.path.join(mm_cfg.INQUEUE_DIR, filebase + '.pck') if os.stat(frompath).st_size < MAXSIZE: os.rename(frompath,topath)

if __name__ == '__main__': main()

Barry Warsaw

8:55 p.m.

New subject: Python process size grows 30x in 8 hours (memory

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 2, 2008, at 11:15 PM, Mark Sapiro wrote:

...

Yes, I know. I was just about to resend. It is attached here. The
MUA I used to send the previous message gives any attachment without an extension Content-Type: application/octet-stream, so the list's
content filtering removed it.

Ah, np.

...

|> It does the following compared to the normal script. | |> The normal script reads the message from the pipe from the MTA and |> queues it in the 'in' queue for processing by an IncomingRunner.
This |> script receives the message and instead queues it in the 'bad'
queue. |> It then looks at the size of the 'bad' queue entry (a Python pickle |> that will be just slightly larger than the message text). If the
size |> is less than MAXSIZE bytes (a parameter near the beginning of the |> script, currently set to 1000000, but which you can change as you |> desire), it moves the queue entry from the 'bad' queue to the 'in' |> queue for processing. | | I'm not sure 'bad' should be used. Perhaps a separate queue called | 'raw'? It is nice that files > MAXSIZE need only be left in 'bad'.

If we're going to do something like this going forward, we can
certainly change the queue. For this 'debug' effort, I wanted to keep it simple and use an existing mm_cfg queue name.

Excellent point. A couple of very minor comments on the file, but
other than that, it looks great. (I know you copied this from the
original file, but still I can't resist. ;)

...

# Copyright (C) 1998,1999,2000,2001,2002 by the Free Software
Foundation, Inc.

1998-2008

...

# # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301, USA.

"""Accept posts to a list and handle them properly.

The main advertised address for a list should be filtered to this
program, through the mail wrapper. E.g. for list test@yourdomain.com', the test' alias would deliver to this script.

Stdin is the mail message, and argv[1] is the name of the target
mailing list.

"""

import os import sys

import paths from Mailman import mm_cfg from Mailman import Utils from Mailman.i18n import _ from Mailman.Queue.sbcache import get_switchboard from Mailman.Logging.Utils import LogStdErr

LogStdErr("error", "post")

MAXSIZE = 1000000

def main(): # TBD: If you've configured your list or aliases so poorly as to
get # either of these first two errors, there's little that can be
done to # save your messages. They will be lost. Minimal testing of new
lists # should avoid either of these problems. try: listname = sys.argv[1] except IndexError: print >> sys.stderr, _('post script got no listname.') sys.exit(1) # Make sure the list exists if not Utils.list_exists(listname): print >> sys.stderr, _('post script, list not found: % (listname)s') sys.exit(1) # Immediately queue the message for the incoming qrunner to
process. The # advantage to this approach is that messages should never get
lost -- # some MTAs have a hard limit to the time a filter prog can run.
Postfix # is a good example; if the limit is hit, the proc is SIGKILL'd
giving us # no chance to save the message. bdq = get_switchboard(mm_cfg.BADQUEUE_DIR) filebase = bdq.enqueue(sys.stdin.read(), listname=listname, tolist=1, _plaintext=1)

Should probably use True there instead of 1.

...

frompath= os.path.join(mm_cfg.BADQUEUE_DIR, filebase + '.pck') topath= os.path.join(mm_cfg.INQUEUE_DIR, filebase + '.pck')

Space in front of the =

...

if os.stat(frompath).st_size < MAXSIZE: os.rename(frompath,topath)

Space after the comma.

...

if __name__ == '__main__': main()

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAkhsTbQACgkQ2YZpQepbvXGGigCfe+w4Ynz/qvFEp6VmurbySv42 b6cAoJa9wuSaql8dLUo8/VXT/Sxiu9pW =Ywsy -----END PGP SIGNATURE-----

Barry Warsaw

9:03 a.m.

New subject: Python process size grows 30x in 8 hours (memory

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 2, 2008, at 11:15 AM, Mark Sapiro wrote:

...

This should be moved to mailman-developers, but in general it's an
interesting idea. In MM3 I've split the incoming queue into two
separate queues. The incoming queue now solely determines the
disposition of the message, i.e. held, rejected, discarded or
accepted. If accepted, the message is moved to a pipeline queue where
it's munged for delivery (i.e. headers and footers added, etc.).

MM3 also has an LMTP queue runner, which I'd like to make the default
delivery mechanism for 3.0 and possibly 2.2 (yes, I still have a todo
to back port MM3's new process architecture to 2.2). Although it's
not there right now, it would be trivial to add a check on the raw
size of the message before it's parsed. If it's too large then it can
be rejected before the email package attempts to parse it, and that
would give the upstream LTMP client (i.e. your MTA) a better diagnostic.

It still makes sense to put a size limit in your MTA so it never hits
the LMTP server because the string will still be in the Python
process's memory. But at least you won't pay the penalty for parsing
such a huge message just to reject it later.

...

This is a good idea. It might be better to do this in
Runner._doperiodic().

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAkhrpt4ACgkQ2YZpQepbvXHAtgCgj0E1IJjf4kkv4TNKkzcB+RFF VxAAn1k01dLfPeKPcOgMxDneSyEB/5Ro =qiO5 -----END PGP SIGNATURE-----

Brad Knowles

9:22 a.m.

New subject: Python process size grows 30x in 8 hours (memory

Fletcher Cocquyt wrote:

...

You can do "mailmanctl restart", but that's not really a proper solution to this problem.

-- Brad Knowles <brad@shub-internet.org> LinkedIn Profile: <http://tinyurl.com/y8kpxu>

Fletcher Cocquyt

10:12 a.m.

New subject: Python process size grows 30x in 8 hours (memory

I am hopeful our esteemed code maintainers are thinking the built in restart idea is a good one:

BW wrote:

...

This is a good idea. It might be better to do this in Runner._doperiodic().

On 7/2/08 9:22 AM, "Brad Knowles" <brad@shub-internet.org> wrote:

...

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Barry Warsaw

10:14 a.m.

New subject: Python process size grows 30x in 8 hours (memory

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 2, 2008, at 1:12 PM, Fletcher Cocquyt wrote:

...

Optionally, yes. By default, I'm not so sure.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAkhrt4UACgkQ2YZpQepbvXE9kACeLg04R4n22C4X3VInoJaaCqyI MdkAoJjgj0qwONIKM425QHh/Glxpo4gm =yOaG -----END PGP SIGNATURE-----

Fletcher Cocquyt

1:54 p.m.

New subject: Python process size grows 30x in 8 hours - dtrace stack

I had a parallel thread on the dtrace list to get memleak.d running

http://blogs.sun.com/sanjeevb/date/200506

I just got this stack trace from a 10 second sample of the most actively growing python mailman process - the output is explained by Sanjeev on his blog, but I'm hoping the stack trace will point the analysis towards a cause for why my mailman processes are growing abnormally

I will see if the findleaks.pl analysis of this output returns anything

Thanks!

0 42246 realloc:return Ptr=0x824c268 Oldptr=0x0 Size=16 libc.so.1realloc+0x33a pythonaddcleanup+0x45 pythonconvertsimple+0x145d pythonvgetargs1+0x259 python_PyArg_ParseTuple_SizeT+0x1d pythonposix_listdir+0x55 pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python`_start+0x80

0 42249 free:entry Ptr=0x824c268 0 42244 lmalloc:return Ptr=0xcf890300 Size=16 libc.so.1lmalloc+0x143 libc.so.1opendir+0x3e pythonposix_listdir+0x6d pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python_start+0x80

0 42244 lmalloc:return Ptr=0xcf894000 Size=8192 libc.so.1lmalloc+0x143 libc.so.1opendir+0x3e pythonposix_listdir+0x6d pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python_start+0x80

0 42249 free:entry Ptr=0x86d78f0 ^C 0 42246 realloc:return Ptr=0x824c268 Oldptr=0x0 Size=16 libc.so.1realloc+0x33a pythonaddcleanup+0x45 pythonconvertsimple+0x145d pythonvgetargs1+0x259 python_PyArg_ParseTuple_SizeT+0x1d pythonposix_listdir+0x55 pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python`_start+0x80

0 42249 free:entry Ptr=0x86d78f0

On 7/2/08 10:14 AM, "Barry Warsaw" <barry@list.org> wrote:

...

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Fletcher Cocquyt

2:16 p.m.

New subject: Python process size grows 30x in 8 hours - dtrace stack

Below is the findleaks output from a ~5minute sample of a python runner - I will take a larger sample to see if this is representative or not: (again the reference is http://blogs.sun.com/sanjeevb/date/200506 )

Thanks

god@irt-smtp-02:~ 2:10pm 67 # ./findleaks.pl ./ml.out Ptr=0xcf890340 Size=16 libc.so.1lmalloc+0x143 libc.so.1opendir+0x3e pythonposix_listdir+0x6d pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python_start+0x80

Ptr=0xcf894000 Size=8192 libc.so.1lmalloc+0x143 libc.so.1opendir+0x3e pythonposix_listdir+0x6d pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python_start+0x80

On 7/2/08 1:54 PM, "Fletcher Cocquyt" <fcocquyt@stanford.edu> wrote:

...

I had a parallel thread on the dtrace list to get memleak.d running

http://blogs.sun.com/sanjeevb/date/200506

I just got this stack trace from a 10 second sample of the most actively growing python mailman process - the output is explained by Sanjeev on his blog, but I'm hoping the stack trace will point the analysis towards a cause for why my mailman processes are growing abnormally

I will see if the findleaks.pl analysis of this output returns anything

Thanks!

0 42246 realloc:return Ptr=0x824c268 Oldptr=0x0 Size=16 libc.so.1realloc+0x33a pythonaddcleanup+0x45 pythonconvertsimple+0x145d pythonvgetargs1+0x259 python_PyArg_ParseTuple_SizeT+0x1d pythonposix_listdir+0x55 pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python`_start+0x80

0 42249 free:entry Ptr=0x824c268 0 42244 lmalloc:return Ptr=0xcf890300 Size=16 libc.so.1lmalloc+0x143 libc.so.1opendir+0x3e pythonposix_listdir+0x6d pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python_start+0x80

0 42244 lmalloc:return Ptr=0xcf894000 Size=8192 libc.so.1lmalloc+0x143 libc.so.1opendir+0x3e pythonposix_listdir+0x6d pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python_start+0x80

0 42249 free:entry Ptr=0x86d78f0 ^C 0 42246 realloc:return Ptr=0x824c268 Oldptr=0x0 Size=16 libc.so.1realloc+0x33a pythonaddcleanup+0x45 pythonconvertsimple+0x145d pythonvgetargs1+0x259 python_PyArg_ParseTuple_SizeT+0x1d pythonposix_listdir+0x55 pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python`_start+0x80

0 42249 free:entry Ptr=0x824c268 0 42244 lmalloc:return Ptr=0xcf890300 Size=16 libc.so.1lmalloc+0x143 libc.so.1opendir+0x3e pythonposix_listdir+0x6d pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python_start+0x80

0 42244 lmalloc:return Ptr=0xcf894000 Size=8192 libc.so.1lmalloc+0x143 libc.so.1opendir+0x3e pythonposix_listdir+0x6d pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python_start+0x80

0 42249 free:entry Ptr=0x86d78f0

On 7/2/08 10:14 AM, "Barry Warsaw" <barry@list.org> wrote:

...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 2, 2008, at 1:12 PM, Fletcher Cocquyt wrote:

...
I am hopeful our esteemed code maintainers are thinking the built in restart idea is a good one:

Optionally, yes. By default, I'm not so sure.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAkhrt4UACgkQ2YZpQepbvXE9kACeLg04R4n22C4X3VInoJaaCqyI MdkAoJjgj0qwONIKM425QHh/Glxpo4gm =yOaG -----END PGP SIGNATURE-----

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Tim Bell

11:37 p.m.

New subject: Python process size grows 30x in 8 hours (memory leak?)

Back at the beginning of this thread, Fletcher Cocquyt wrote:

...

With Solaris 10, you can interpose the libumem library when starting those python processes. This gives you different malloc()/free() allocators including extra instrumentation that is low enough in overhead to run in a production environment, and (when combined with mdb) a powerful set of debugging tools.

Set LD_PRELOAD, UMEM_DEBUG, UMEM_LOGGING environment variables in the parent process before starting python so they will inherit the settings. If you have to, you could replace 'python' with a script that sets what you want in the environment and then runs the python executable.

I know this will be looking at the lower, native layers of the problem, and you may not see the upper (python) part of the stack very well, but libumem has been a big help to me so I thought I would mention it.

Here are two references... there are many more if you start searching:

Identifying Memory Management Bugs Within Applications Using the libumem Library http://access1.sun.com/techarticles/libumem.html

Solaris Modular Debugger Guide http://docs.sun.com/db/doc/806-6545

Hope this helps - this is too long, so I'll stop now.

Tim

Brad Knowles

June 2008

9:57 a.m.

Fletcher Cocquyt wrote:

...

Search the FAQ for performance. The short URL for the web page is <http://wiki.list.org/x/AgA3>.

-- Brad Knowles <brad@python.org> Member of the Python.org Postmaster Team & Co-Moderator of the mailman-users and mailman-developers mailing lists

Mark Sapiro

June 2008

9:01 a.m.

Fletcher Cocquyt wrote:

...

Is Spamassassin invoked from Mailman or from the MTA before Mailman? If this plain Mailman, 10 seconds is a hugely long time to process a single post through IncomingRunner.

...

See the section of Defaults.py headed with

##### # Qrunner defaults #####

In order to run multiple, parallel IncomingRunner processes, you can either copy the entire QRUNNERS definition from Defaults.py to mm_cfg.py and change

('IncomingRunner', 1), # posts from the outside world

('IncomingRunner', 4), # posts from the outside world

which says run 4 IncomingRunner processes, or you can just add something like

QRUNNERS[QRUNNERS.index(('IncomingRunner',1))] = ('IncomingRunner',4)

to mm_cfg.py. You can use any power of two for the number.

...

The following search will return some information.

<http://www.google.com/search?q=site%3Amail.python.org++inurl%3Amailman++%22l...>

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Fletcher Cocquyt

5:55 p.m.

I am compiling the latest 2.5.2 to see if that helps - for now the workaround is to restart mailman occasionally.

(and yes the spamassassin checks are the source of the 4-10 second delay - now those happen in parallel x16 - so no spikes in the backlog...)

Thanks again

On 6/20/08 9:01 AM, "Mark Sapiro" <mark@msapiro.net> wrote:

...

Fletcher Cocquyt wrote:

...
Hi, I am observing periods of qfiles/in backlogs in the 400-600 message count range that take 1-2hours to clear with the standard Mailman 2.1.9 + Spamassassin (the vette log shows these messages process in an avg of ~10 seconds each)

Is Spamassassin invoked from Mailman or from the MTA before Mailman? If this plain Mailman, 10 seconds is a hugely long time to process a single post through IncomingRunner.

If you have some Spamassassin interface like <http://sourceforge.net/tracker/index.php?func=detail&aid=640518&group_id=103& atid=300103> that calls spamd from a Mailman handler, you might consider moving Spamassassin ahead of Mailman and using something like <http://sourceforge.net/tracker/index.php?func=detail&aid=840426&group_id=103& atid=300103> or just header_filter_rules instead.

...
Is there an easy way to parallelize what looks like a single serialized Mailman queue? I see some posts re: multi-slice but nothing definitive

See the section of Defaults.py headed with

##### # Qrunner defaults #####

In order to run multiple, parallel IncomingRunner processes, you can either copy the entire QRUNNERS definition from Defaults.py to mm_cfg.py and change
('IncomingRunner', 1), # posts from the outside world
to
('IncomingRunner', 4), # posts from the outside world
which says run 4 IncomingRunner processes, or you can just add something like

QRUNNERS[QRUNNERS.index(('IncomingRunner',1))] = ('IncomingRunner',4)

to mm_cfg.py. You can use any power of two for the number.

...
I would also like the option of working this into an overall loadbalancing scheme where I have multiple smtp nodes behind an F5 loadbalancer and the nodes share an NFS backend...

The following search will return some information.

<http://www.google.com/search?q=site%3Amail.python.org++inurl%3Amailman++%22l... ad+balancing%22>

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Brad Knowles

10:44 p.m.

On 6/23/08, Fletcher Cocquyt wrote:

...

(and yes the spamassassin checks are the source of the 4-10 second delay - now those happen in parallel x16 - so no spikes in the backlog...)

There's no sense running the same message through SpamAssassin (or whatever) thousands of times, if you can do it once on input and then never again.

-- Brad Knowles <brad@shub-internet.org> LinkedIn Profile: <http://tinyurl.com/y8kpxu>

Fletcher Cocquyt

July 2008

9:19 a.m.

New subject: Python process size grows 30x in 8 hours (memory leak?)

At this point I am looking for ways to isolate the suspected memory leak - I am looking at using dtrace: http://blogs.sun.com/sanjeevb/date/200506

Any other tips appreciated!

On 6/23/08 8:55 PM, "Fletcher Cocquyt" <fcocquyt@stanford.edu> wrote:

...

Mike, many thanks for your (as always) very helpful response - I added the 1 liner to mm_cfg.py to increase the threads to 16. Now I am observing (via memory trend graphs) an acceleration of what looks like a memory leak - maybe from python - currently at 2.4

I am compiling the latest 2.5.2 to see if that helps - for now the workaround is to restart mailman occasionally.

(and yes the spamassassin checks are the source of the 4-10 second delay - now those happen in parallel x16 - so no spikes in the backlog...)

Thanks again

On 6/20/08 9:01 AM, "Mark Sapiro" <mark@msapiro.net> wrote:

...
Fletcher Cocquyt wrote:

...
Hi, I am observing periods of qfiles/in backlogs in the 400-600 message count range that take 1-2hours to clear with the standard Mailman 2.1.9 + Spamassassin (the vette log shows these messages process in an avg of ~10 seconds each)

Is Spamassassin invoked from Mailman or from the MTA before Mailman? If this plain Mailman, 10 seconds is a hugely long time to process a single post through IncomingRunner.

If you have some Spamassassin interface like

<http://sourceforge.net/tracker/index.php?func=detail&aid=640518&group_id=103>> &

...
atid=300103> that calls spamd from a Mailman handler, you might consider moving Spamassassin ahead of Mailman and using something like

<http://sourceforge.net/tracker/index.php?func=detail&aid=840426&group_id=103>> &

...
atid=300103> or just header_filter_rules instead.

...
Is there an easy way to parallelize what looks like a single serialized Mailman queue? I see some posts re: multi-slice but nothing definitive

See the section of Defaults.py headed with

##### # Qrunner defaults #####

In order to run multiple, parallel IncomingRunner processes, you can either copy the entire QRUNNERS definition from Defaults.py to mm_cfg.py and change
('IncomingRunner', 1), # posts from the outside world
to
('IncomingRunner', 4), # posts from the outside world
which says run 4 IncomingRunner processes, or you can just add something like

QRUNNERS[QRUNNERS.index(('IncomingRunner',1))] = ('IncomingRunner',4)

to mm_cfg.py. You can use any power of two for the number.

...
I would also like the option of working this into an overall loadbalancing scheme where I have multiple smtp nodes behind an F5 loadbalancer and the nodes share an NFS backend...

The following search will return some information.
<http://www.google.com/search?q=site%3Amail.python.org++inurl%3Amailman++%22l>> o

...
ad+balancing%22>

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Vidiot

10:09 a.m.

New subject: Python process size grows 30x in 8 hours (memory

...

I'd start by installing 2.1.11, which was just released yesterday.

MB

Fletcher Cocquyt

12:56 p.m.

New subject: Python process size grows 30x in 8 hours (memory

I'm having a hard time finding the release notes for 2.1.11 - can you please provide a link? (I want to see where it details any memory leak fixes since 2.1.9)

thanks

On 7/1/08 10:09 AM, "Vidiot" <brown@mrvideo.vidiot.com> wrote:

...

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Vidiot

July 2008

1:05 p.m.

New subject: Python process size grows 30x in 8 hours (memory

...

There should be one on the list.org website. If not, I do not know where it is. Should also be in the package.

MB

Fletcher Cocquyt

1:28 p.m.

New subject: Python process size grows 30x in 8 hours (memory

Not finding a "leak" ref - save a irrelevant (for this runner issue) admindb one:

Thanks

On 7/1/08 1:05 PM, "Vidiot" <brown@mrvideo.vidiot.com> wrote:

...

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Mark Sapiro

3:37 p.m.

New subject: Python process size grows 30x in 8 hours (memory

Fletcher Cocquyt wrote:

...

Not finding a "leak" ref - save a irrelevant (for this runner issue) admindb

Nothing has been done in Mailman to fix any memory leaks. As far as I know, nothing has been done to create any either.

If there is a leak, it is most likely in the underlying Python and not a Mailman issue per se.

Also, in an 8 hour period, I would expect that RetryRunner and CommandRunner and, unless you are doing a lot of mail -> news gatewaying, NewsRunner to have done virtually nothing.

In this snapshot

Which processes correspond to which runners. And why are the two processes that have apparently done the least the ones that have grown the most.

In fact, why are none of these 15 PIDs the same as the ones from 8 hours earlier, or was that snapshot actually from after the above were restarted?

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Fletcher Cocquyt

4:20 p.m.

New subject: Python process size grows 30x in 8 hours (memory

On 7/1/08 3:37 PM, "Mark Sapiro" <mark@msapiro.net> wrote:

...

Fletcher Cocquyt wrote:

...
Not finding a "leak" ref - save a irrelevant (for this runner issue) admindb

Nothing has been done in Mailman to fix any memory leaks. As far as I know, nothing has been done to create any either.

Ok, thanks for confirming that - I will not prioritize a mailman 2.1.9->2.1.11 upgrade

...

If there is a leak, it is most likely in the underlying Python and not a Mailman issue per se.

Agreed - hence my first priority to upgrade from python 2.4.x to 2.5.2 (the latest on python.org) - but upgrading did not help this

...

I am curious. You say this problem was exacerbated when you went from one IncomingRunner to eight (sliced) IncomingRunners. The IncomingRunner instances themselves should be processing fewer messages each, and I would expect them to leak less. The other runners are doing the same as before so I would expect them to be the same unless by solving your 'in' queue backlog, you're just handling a whole lot more messages.

...

Also, in an 8 hour period, I would expect that RetryRunner and CommandRunner and, unless you are doing a lot of mail -> news gatewaying, NewsRunner to have done virtually nothing.

In this snapshot

PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 10123 mailman 1 59 0 314M 311M sleep 1:57 0.02% python 10131 mailman 1 59 0 310M 307M sleep 1:35 0.01% python 10124 mailman 1 59 0 309M 78M sleep 0:45 0.10% python 10134 mailman 1 59 0 307M 81M sleep 1:27 0.01% python 10125 mailman 1 59 0 307M 79M sleep 0:42 0.01% python 10133 mailman 1 59 0 44M 41M sleep 0:14 0.01% python 10122 mailman 1 59 0 34M 30M sleep 0:43 0.39% python 10127 mailman 1 59 0 31M 27M sleep 0:40 0.26% python 10130 mailman 1 59 0 30M 26M sleep 0:15 0.03% python 10129 mailman 1 59 0 28M 24M sleep 0:19 0.10% python 10126 mailman 1 59 0 28M 25M sleep 1:07 0.59% python 10132 mailman 1 59 0 27M 24M sleep 1:00 0.46% python 10128 mailman 1 59 0 27M 24M sleep 0:16 0.01% python 10151 mailman 1 59 0 9516K 3852K sleep 0:05 0.01% python 10150 mailman 1 59 0 9500K 3764K sleep 0:00 0.00% python

Which processes correspond to which runners. And why are the two processes that have apparently done the least the ones that have grown the most.

In fact, why are none of these 15 PIDs the same as the ones from 8 hours earlier, or was that snapshot actually from after the above were restarted? Yes, I snapshot'ed the current leaked state, then restarted and snapped those new PIDs to show the size diff.

Thanks for the analysis, Fletcher

Mark Sapiro

6:14 p.m.

New subject: Python process size grows 30x in 8 hours (memory

Fletcher Cocquyt wrote:

...

Here is the current leaked state since the the cron 13:27 restart only 3 hours ago: last pid: 20867; load averages: 0.53, 0.47, 0.24 16:04:15 91 processes: 90 sleeping, 1 on cpu CPU states: 99.1% idle, 0.3% user, 0.6% kernel, 0.0% iowait, 0.0% swap Memory: 1640M real, 77M free, 1509M swap in use, 1699M swap free

PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 24167 mailman 1 59 0 311M 309M sleep 0:28 0.02% python 24158 mailman 1 59 0 308M 305M sleep 0:30 0.01% python 24169 mailman 1 59 0 303M 301M sleep 0:28 0.01% python 24165 mailman 1 59 0 29M 27M sleep 0:09 0.03% python 24161 mailman 1 59 0 29M 27M sleep 0:12 0.07% python 24164 mailman 1 59 0 28M 26M sleep 0:07 0.01% python 24172 mailman 1 59 0 26M 24M sleep 0:04 0.01% python 24160 mailman 1 59 0 26M 24M sleep 0:08 0.01% python 24162 mailman 1 59 0 26M 23M sleep 0:10 0.01% python 24166 mailman 1 59 0 26M 23M sleep 0:04 0.01% python 24171 mailman 1 59 0 25M 23M sleep 0:04 0.02% python 24163 mailman 1 59 0 24M 22M sleep 0:04 0.01% python 24168 mailman 1 59 0 19M 17M sleep 0:03 0.02% python 24170 mailman 1 59 0 9516K 6884K sleep 0:01 0.01% python 24159 mailman 1 59 0 9500K 6852K sleep 0:00 0.00% python

And the mapping to the runners: god@irt-smtp-02:mailman-2.1.11 4:16pm 66 # /usr/ucb/ps auxw | egrep mailman | awk '{print $2 " " $11}' 24167 --runner=IncomingRunner:5:8 24165 --runner=BounceRunner:0:1 24158 --runner=IncomingRunner:7:8 24162 --runner=VirginRunner:0:1 24163 --runner=IncomingRunner:1:8 24166 --runner=IncomingRunner:0:8 24168 --runner=IncomingRunner:4:8 24169 --runner=IncomingRunner:2:8 24171 --runner=IncomingRunner:6:8 24172 --runner=IncomingRunner:3:8 24160 --runner=CommandRunner:0:1 24161 --runner=OutgoingRunner:0:1 24164 --runner=ArchRunner:0:1 24170 /bin/python 24159 /bin/python

Here's some additional thought.

If you allowed them to do that once, the MemoryError traceback might provide a clue.

Caveat! I know very little about Python's memory management. Some of what follows may be wrong.

Also, do you have any really big lists with big config.pck files. If so, Runners will grow as they instantiate that (those) big list(s).

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Mark Sapiro

7:16 p.m.

New subject: Python process size grows 30x in 8 hours (memory

Mark Sapiro wrote:

...

and

...

Doh? I finally noticed these are in K and the others are in M so that question is answered at least - the two that haven't done anything actually haven't grown.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Fletcher Cocquyt

July 2008

8:26 p.m.

New subject: Python process size grows 30x in 8 hours (memory

Pmap shows its the heap

Whether its a leak or not - we need to understand why the heap is growing and put a limit on its growth to avoid exausting memory and swapping into oblivion...

None of the lists seem too big: god@irt-smtp-02:lists 8:24pm 73 # du -sk */*pck | sort -nr | head | awk '{print $1}' 1392 1240 1152 1096 912 720 464 168 136 112

Researching python heap alloaction....

thanks

On 7/1/08 6:14 PM, "Mark Sapiro" <mark@msapiro.net> wrote:

...

Fletcher Cocquyt wrote:

...
Here is the current leaked state since the the cron 13:27 restart only 3 hours ago: last pid: 20867; load averages: 0.53, 0.47, 0.24 16:04:15 91 processes: 90 sleeping, 1 on cpu CPU states: 99.1% idle, 0.3% user, 0.6% kernel, 0.0% iowait, 0.0% swap Memory: 1640M real, 77M free, 1509M swap in use, 1699M swap free

PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 24167 mailman 1 59 0 311M 309M sleep 0:28 0.02% python 24158 mailman 1 59 0 308M 305M sleep 0:30 0.01% python 24169 mailman 1 59 0 303M 301M sleep 0:28 0.01% python 24165 mailman 1 59 0 29M 27M sleep 0:09 0.03% python 24161 mailman 1 59 0 29M 27M sleep 0:12 0.07% python 24164 mailman 1 59 0 28M 26M sleep 0:07 0.01% python 24172 mailman 1 59 0 26M 24M sleep 0:04 0.01% python 24160 mailman 1 59 0 26M 24M sleep 0:08 0.01% python 24162 mailman 1 59 0 26M 23M sleep 0:10 0.01% python 24166 mailman 1 59 0 26M 23M sleep 0:04 0.01% python 24171 mailman 1 59 0 25M 23M sleep 0:04 0.02% python 24163 mailman 1 59 0 24M 22M sleep 0:04 0.01% python 24168 mailman 1 59 0 19M 17M sleep 0:03 0.02% python 24170 mailman 1 59 0 9516K 6884K sleep 0:01 0.01% python 24159 mailman 1 59 0 9500K 6852K sleep 0:00 0.00% python

And the mapping to the runners: god@irt-smtp-02:mailman-2.1.11 4:16pm 66 # /usr/ucb/ps auxw | egrep mailman | awk '{print $2 " " $11}' 24167 --runner=IncomingRunner:5:8 24165 --runner=BounceRunner:0:1 24158 --runner=IncomingRunner:7:8 24162 --runner=VirginRunner:0:1 24163 --runner=IncomingRunner:1:8 24166 --runner=IncomingRunner:0:8 24168 --runner=IncomingRunner:4:8 24169 --runner=IncomingRunner:2:8 24171 --runner=IncomingRunner:6:8 24172 --runner=IncomingRunner:3:8 24160 --runner=CommandRunner:0:1 24161 --runner=OutgoingRunner:0:1 24164 --runner=ArchRunner:0:1 24170 /bin/python 24159 /bin/python

What are these last 2? Presumably they are the missing NewsRunner and RetryRunner, but what is the extra stuff in the ps output causing $11 to be the python command and not the runner option? And again, why are these two, which presumably have done nothing, seemingly the biggest.

Here's some additional thought.

Are you sure there is an actual leak? Do you know that if you just let them run, they don't reach some stable size and remain there as opposed to growing so large that they eventually throw a MemoryError exception and get restarted by mailmanctl.

If you allowed them to do that once, the MemoryError traceback might provide a clue.

Caveat! I know very little about Python's memory management. Some of what follows may be wrong.

Here's what I think - Python allocates more memory (from the OS) as needed to import additional modules and create new objects. Imports don't go away, but objects that are destroyed or become unreachable (eg a file object that is closed or a message object whose only reference gets assigned to something else) become candidates for garbage collection and ultimately the memory allocated to them is collected and reused (assuming no leaks). I *think* however, that no memory is ever actually freed back to the OS. Thus, Python processes that run for a long time can grow, but don't shrink.

Now, IncomingRunner in particular can get very large if large messages are arriving, even if those messages are ultimately not processed very far. Incoming runner reads the entire message into memory and then parses it into a message object which is even bigger than the message string. So, if someone happens to send a 100MB attachment to a list, IncomingRunner is going to need over 200MB before it ever looks at the message itself. This memory will later become available for other use within that IncomingRunner instance, but I don't think it is ever freed back to the OS.

Also, I see very little memory change between the 3 hour old snapshot above and the 8 hour old one from your prior post. If this is really a memory leak, I'd expect the 8 hour old ones to be perhaps twice as big as the 3 hour old ones.

Also, do you have any really big lists with big config.pck files. If so, Runners will grow as they instantiate that (those) big list(s).

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Mark Sapiro

9:22 p.m.

New subject: Python process size grows 30x in 8 hours (memory

Fletcher Cocquyt wrote:

...

At this point, I don't think it's a leak.

Your runners start out at about 9.5 MB. Most of your working runners grow to about the 20-40 MB range which I don't think is unusual for a site with some config.pck files approaching 1.4 MB.

Does your MTA limit incoming message size?

I don't think you'll see a lot of paging activity of that 300+MB because I suspect that most of the time nothing is going on in most of that memory

...

You may also be interested in the FAQ article at <http://wiki.list.org/x/94A9>.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Brad Knowles

9:48 p.m.

New subject: Python process size grows 30x in 8 hours (memory

On 7/1/08, Mark Sapiro wrote:

...

You may also be interested in the FAQ article at <http://wiki.list.org/x/94A9>.

It would be interesting to have all those same commands run for the system in question, to compare with the numbers in the FAQ.

-- Brad Knowles <brad@shub-internet.org> LinkedIn Profile: <http://tinyurl.com/y8kpxu>

Brad Knowles

10:14 p.m.

New subject: Python process size grows 30x in 8 hours (memory

On 7/1/08, Fletcher Cocquyt wrote:

...

No heap showing up anywhere. Doing the same for our IncomingRunner, I get:

Again, no heap.

...

Where did you do this? In the /usr/local/mailman directory?

4075 Python-list
3305 Tutor
2600 Mailman-Users
2329 Mailman-announce
1528 Python-announce-list

Of these, python-list and tutor frequently gets between twenty to a hundred or more messages in a day. However, here's their respective list.pck files, using the same "du -sk" script from above:

904 tutor/config.pck 652 python-list/config.pck 476 mailman-users/config.pck 324 mailman-announce/config.pck 208 python-announce-list/config.pck

-- Brad Knowles <brad@shub-internet.org> LinkedIn Profile: <http://tinyurl.com/y8kpxu>

Brad Knowles

10:21 p.m.

New subject: Python process size grows 30x in 8 hours (memory

On 7/1/08, Fletcher Cocquyt wrote:

...

-- Brad Knowles <brad@shub-internet.org> LinkedIn Profile: <http://tinyurl.com/y8kpxu>

Brad Knowles

9:58 p.m.

New subject: Python process size grows 30x in 8 hours (memory

On 7/1/08, Mark Sapiro wrote:

...

In contrast, the mail server for python.org shows the following:

top - 06:54:48 up 29 days, 9:09, 4 users, load average: 1.05, 1.08, 0.95 Tasks: 151 total, 1 running, 149 sleeping, 0 stopped, 1 zombie Cpu(s): 0.2% user, 1.1% system, 0.0% nice, 98.7% idle

And those are the only Python-related processes that show up in the first twenty lines.

-- Brad Knowles <brad@shub-internet.org> LinkedIn Profile: <http://tinyurl.com/y8kpxu>

Fletcher Cocquyt

July 2008

11:05 p.m.

New subject: Python process size grows 30x in 8 hours (memory

I did a test - I disabled the SpamAssassin integration and watched the heap grow steadily - I do not believe its SA related:

thanks

On 7/1/08 9:58 PM, "Brad Knowles" <brad@shub-internet.org> wrote:

...

On 7/1/08, Mark Sapiro wrote:

...
In this snapshot

PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 10123 mailman 1 59 0 314M 311M sleep 1:57 0.02% python 10131 mailman 1 59 0 310M 307M sleep 1:35 0.01% python 10124 mailman 1 59 0 309M 78M sleep 0:45 0.10% python 10134 mailman 1 59 0 307M 81M sleep 1:27 0.01% python 10125 mailman 1 59 0 307M 79M sleep 0:42 0.01% python 10133 mailman 1 59 0 44M 41M sleep 0:14 0.01% python 10122 mailman 1 59 0 34M 30M sleep 0:43 0.39% python 10127 mailman 1 59 0 31M 27M sleep 0:40 0.26% python 10130 mailman 1 59 0 30M 26M sleep 0:15 0.03% python 10129 mailman 1 59 0 28M 24M sleep 0:19 0.10% python 10126 mailman 1 59 0 28M 25M sleep 1:07 0.59% python 10132 mailman 1 59 0 27M 24M sleep 1:00 0.46% python 10128 mailman 1 59 0 27M 24M sleep 0:16 0.01% python 10151 mailman 1 59 0 9516K 3852K sleep 0:05 0.01% python 10150 mailman 1 59 0 9500K 3764K sleep 0:00 0.00% python

Which processes correspond to which runners. And why are the two processes that have apparently done the least the ones that have grown the most.

In contrast, the mail server for python.org shows the following:

top - 06:54:48 up 29 days, 9:09, 4 users, load average: 1.05, 1.08, 0.95 Tasks: 151 total, 1 running, 149 sleeping, 0 stopped, 1 zombie Cpu(s): 0.2% user, 1.1% system, 0.0% nice, 98.7% idle

PID USER PR VIRT NI RES SHR S %CPU TIME+ %MEM COMMAND 1040 mailman 9 42960 0 41m 12m S 0 693:59.44 2.1 ArchRunner:0:1 -s 1041 mailman 9 22876 0 20m 7488 S 0 478:18.62 1.0 BounceRunner:0:1 1045 mailman 9 20412 0 19m 10m S 0 3031:12 0.9 OutgoingRunner:0: 1043 mailman 9 20476 0 18m 4968 S 0 127:02.62 0.9 IncomingRunner:0: 1042 mailman 9 18564 0 17m 7316 S 0 11:34.14 0.9 CommandRunner:0:1 1046 mailman 11 17276 0 15m 10m S 1 66:32.16 0.8 VirginRunner:0:1 1044 mailman 9 11568 0 9964 5184 S 0 12:34.04 0.5 NewsRunner:0:1 -s

And those are the only Python-related processes that show up in the first twenty lines.

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Mark Sapiro

8:15 a.m.

New subject: Python process size grows 30x in 8 hours (memory

Fletcher Cocquyt wrote:

...

I did a test - I disabled the SpamAssassin integration and watched the heap grow steadily - I do not believe its SA related:

OK.

Does your MTA limit the size of incoming messages? Can it?

...

bin/mailmanctl could be modified to do this automatically, but currently only does it on command (restart) or signal (SIGINT), but I gather you're already running a cron that does a periodic restart.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Fletcher Cocquyt

8:55 a.m.

New subject: Python process size grows 30x in 8 hours (memory

On 7/2/08 8:15 AM, "Mark Sapiro" <mark@msapiro.net> wrote:

...

Yes, having a global incoming maxmessagesize limit and handler (what will the sender receive back?) for mailman would be useful.

...

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Mark Sapiro

8:01 p.m.

New subject: Python process size grows 30x in 8 hours (memory

Fletcher Cocquyt wrote:

...

The attached 'post' file is a modified version of scripts/post.

It does the following compared to the normal script.

The delivery is accepted by the MTA in either case so the poster sees nothing unusual.

This is not intended to be used in a normal production environment. It is only intended as a debug aid to see if IncomingRunners will not grow so large if incoming message size is limited.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Barry Warsaw

8:05 p.m.

New subject: Python process size grows 30x in 8 hours (memory

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 2, 2008, at 11:01 PM, Mark Sapiro wrote:

...

The attached 'post' file is a modified version of scripts/post.

Hi Mark, there was no attachment.

...

I'm not sure 'bad' should be used. Perhaps a separate queue called
'raw'? It is nice that files > MAXSIZE need only be left in 'bad'.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAkhsQekACgkQ2YZpQepbvXEBPQCfUUH1ZxUkzXUVfkPF0iZ5c2sK JPMAoJiJNehIX+E24fyYeAQMbKkwI2Kv =MF40 -----END PGP SIGNATURE-----

Mark Sapiro

8:15 p.m.

New subject: Python process size grows 30x in 8 hours (memory

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Barry Warsaw wrote: | On Jul 2, 2008, at 11:01 PM, Mark Sapiro wrote: | |> The attached 'post' file is a modified version of scripts/post. | | Hi Mark, there was no attachment.

If we're going to do something like this going forward, we can certainly change the queue. For this 'debug' effort, I wanted to keep it simple and use an existing mm_cfg queue name.

Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32)

iD8DBQFIbERTVVuXXpU7hpMRAtD3AJ4wak9befESKQlF3t2ZKos9W2WuTQCfbOCB Yh9VIStJMHWfiLVlYjM5uoo= =bU1+ -----END PGP SIGNATURE-----

"""Accept posts to a list and handle them properly.

The main advertised address for a list should be filtered to this program, through the mail wrapper. E.g. for list test@yourdomain.com', the test' alias would deliver to this script.

Stdin is the mail message, and argv[1] is the name of the target mailing list.

"""

import os import sys

import paths from Mailman import mm_cfg from Mailman import Utils from Mailman.i18n import _ from Mailman.Queue.sbcache import get_switchboard from Mailman.Logging.Utils import LogStdErr

LogStdErr("error", "post")

MAXSIZE = 1000000

if __name__ == '__main__': main()

Barry Warsaw

July 2008

8:55 p.m.

New subject: Python process size grows 30x in 8 hours (memory

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 2, 2008, at 11:15 PM, Mark Sapiro wrote:

...

Yes, I know. I was just about to resend. It is attached here. The
MUA I used to send the previous message gives any attachment without an extension Content-Type: application/octet-stream, so the list's
content filtering removed it.

Ah, np.

...

|> It does the following compared to the normal script. | |> The normal script reads the message from the pipe from the MTA and |> queues it in the 'in' queue for processing by an IncomingRunner.
This |> script receives the message and instead queues it in the 'bad'
queue. |> It then looks at the size of the 'bad' queue entry (a Python pickle |> that will be just slightly larger than the message text). If the
size |> is less than MAXSIZE bytes (a parameter near the beginning of the |> script, currently set to 1000000, but which you can change as you |> desire), it moves the queue entry from the 'bad' queue to the 'in' |> queue for processing. | | I'm not sure 'bad' should be used. Perhaps a separate queue called | 'raw'? It is nice that files > MAXSIZE need only be left in 'bad'.

If we're going to do something like this going forward, we can
certainly change the queue. For this 'debug' effort, I wanted to keep it simple and use an existing mm_cfg queue name.

Excellent point. A couple of very minor comments on the file, but
other than that, it looks great. (I know you copied this from the
original file, but still I can't resist. ;)

...

# Copyright (C) 1998,1999,2000,2001,2002 by the Free Software
Foundation, Inc.

1998-2008

...

# # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301, USA.

"""Accept posts to a list and handle them properly.

The main advertised address for a list should be filtered to this
program, through the mail wrapper. E.g. for list test@yourdomain.com', the test' alias would deliver to this script.

Stdin is the mail message, and argv[1] is the name of the target
mailing list.

"""

import os import sys

import paths from Mailman import mm_cfg from Mailman import Utils from Mailman.i18n import _ from Mailman.Queue.sbcache import get_switchboard from Mailman.Logging.Utils import LogStdErr

LogStdErr("error", "post")

MAXSIZE = 1000000

def main(): # TBD: If you've configured your list or aliases so poorly as to
get # either of these first two errors, there's little that can be
done to # save your messages. They will be lost. Minimal testing of new
lists # should avoid either of these problems. try: listname = sys.argv[1] except IndexError: print >> sys.stderr, _('post script got no listname.') sys.exit(1) # Make sure the list exists if not Utils.list_exists(listname): print >> sys.stderr, _('post script, list not found: % (listname)s') sys.exit(1) # Immediately queue the message for the incoming qrunner to
process. The # advantage to this approach is that messages should never get
lost -- # some MTAs have a hard limit to the time a filter prog can run.
Postfix # is a good example; if the limit is hit, the proc is SIGKILL'd
giving us # no chance to save the message. bdq = get_switchboard(mm_cfg.BADQUEUE_DIR) filebase = bdq.enqueue(sys.stdin.read(), listname=listname, tolist=1, _plaintext=1)

Should probably use True there instead of 1.

...

frompath= os.path.join(mm_cfg.BADQUEUE_DIR, filebase + '.pck') topath= os.path.join(mm_cfg.INQUEUE_DIR, filebase + '.pck')

Space in front of the =

...

if os.stat(frompath).st_size < MAXSIZE: os.rename(frompath,topath)

Space after the comma.

...

if __name__ == '__main__': main()

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAkhsTbQACgkQ2YZpQepbvXGGigCfe+w4Ynz/qvFEp6VmurbySv42 b6cAoJa9wuSaql8dLUo8/VXT/Sxiu9pW =Ywsy -----END PGP SIGNATURE-----

Barry Warsaw

9:03 a.m.

New subject: Python process size grows 30x in 8 hours (memory

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 2, 2008, at 11:15 AM, Mark Sapiro wrote:

...

This is a good idea. It might be better to do this in
Runner._doperiodic().

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAkhrpt4ACgkQ2YZpQepbvXHAtgCgj0E1IJjf4kkv4TNKkzcB+RFF VxAAn1k01dLfPeKPcOgMxDneSyEB/5Ro =qiO5 -----END PGP SIGNATURE-----

Brad Knowles

9:22 a.m.

New subject: Python process size grows 30x in 8 hours (memory

Fletcher Cocquyt wrote:

...

You can do "mailmanctl restart", but that's not really a proper solution to this problem.

-- Brad Knowles <brad@shub-internet.org> LinkedIn Profile: <http://tinyurl.com/y8kpxu>

Fletcher Cocquyt

10:12 a.m.

New subject: Python process size grows 30x in 8 hours (memory

I am hopeful our esteemed code maintainers are thinking the built in restart idea is a good one:

BW wrote:

...

This is a good idea. It might be better to do this in Runner._doperiodic().

On 7/2/08 9:22 AM, "Brad Knowles" <brad@shub-internet.org> wrote:

...

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Barry Warsaw

10:14 a.m.

New subject: Python process size grows 30x in 8 hours (memory

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 2, 2008, at 1:12 PM, Fletcher Cocquyt wrote:

...

Optionally, yes. By default, I'm not so sure.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAkhrt4UACgkQ2YZpQepbvXE9kACeLg04R4n22C4X3VInoJaaCqyI MdkAoJjgj0qwONIKM425QHh/Glxpo4gm =yOaG -----END PGP SIGNATURE-----

Fletcher Cocquyt

1:54 p.m.

New subject: Python process size grows 30x in 8 hours - dtrace stack

I had a parallel thread on the dtrace list to get memleak.d running

http://blogs.sun.com/sanjeevb/date/200506

I just got this stack trace from a 10 second sample of the most actively growing python mailman process - the output is explained by Sanjeev on his blog, but I'm hoping the stack trace will point the analysis towards a cause for why my mailman processes are growing abnormally

I will see if the findleaks.pl analysis of this output returns anything

Thanks!

0 42249 free:entry Ptr=0x86d78f0

On 7/2/08 10:14 AM, "Barry Warsaw" <barry@list.org> wrote:

...

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Fletcher Cocquyt

July 2008

2:16 p.m.

New subject: Python process size grows 30x in 8 hours - dtrace stack

Thanks

On 7/2/08 1:54 PM, "Fletcher Cocquyt" <fcocquyt@stanford.edu> wrote:

...

I had a parallel thread on the dtrace list to get memleak.d running

http://blogs.sun.com/sanjeevb/date/200506

I just got this stack trace from a 10 second sample of the most actively growing python mailman process - the output is explained by Sanjeev on his blog, but I'm hoping the stack trace will point the analysis towards a cause for why my mailman processes are growing abnormally

I will see if the findleaks.pl analysis of this output returns anything

Thanks!

0 42246 realloc:return Ptr=0x824c268 Oldptr=0x0 Size=16 libc.so.1realloc+0x33a pythonaddcleanup+0x45 pythonconvertsimple+0x145d pythonvgetargs1+0x259 python_PyArg_ParseTuple_SizeT+0x1d pythonposix_listdir+0x55 pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python`_start+0x80

0 42249 free:entry Ptr=0x824c268 0 42244 lmalloc:return Ptr=0xcf890300 Size=16 libc.so.1lmalloc+0x143 libc.so.1opendir+0x3e pythonposix_listdir+0x6d pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python_start+0x80

0 42244 lmalloc:return Ptr=0xcf894000 Size=8192 libc.so.1lmalloc+0x143 libc.so.1opendir+0x3e pythonposix_listdir+0x6d pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python_start+0x80

0 42249 free:entry Ptr=0x86d78f0 ^C 0 42246 realloc:return Ptr=0x824c268 Oldptr=0x0 Size=16 libc.so.1realloc+0x33a pythonaddcleanup+0x45 pythonconvertsimple+0x145d pythonvgetargs1+0x259 python_PyArg_ParseTuple_SizeT+0x1d pythonposix_listdir+0x55 pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python`_start+0x80

0 42249 free:entry Ptr=0x824c268 0 42244 lmalloc:return Ptr=0xcf890300 Size=16 libc.so.1lmalloc+0x143 libc.so.1opendir+0x3e pythonposix_listdir+0x6d pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python_start+0x80

0 42244 lmalloc:return Ptr=0xcf894000 Size=8192 libc.so.1lmalloc+0x143 libc.so.1opendir+0x3e pythonposix_listdir+0x6d pythonPyEval_EvalFrameEx+0x59ff pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalFrameEx+0x49ff pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalFrameEx+0x6133 pythonPyEval_EvalCodeEx+0x57f pythonPyEval_EvalCode+0x22 pythonPyRun_FileExFlags+0xaf pythonPyRun_SimpleFileExFlags+0x156 pythonPy_Main+0xa6b pythonmain+0x17 python_start+0x80

0 42249 free:entry Ptr=0x86d78f0

On 7/2/08 10:14 AM, "Barry Warsaw" <barry@list.org> wrote:

...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 2, 2008, at 1:12 PM, Fletcher Cocquyt wrote:

...
I am hopeful our esteemed code maintainers are thinking the built in restart idea is a good one:

Optionally, yes. By default, I'm not so sure.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAkhrt4UACgkQ2YZpQepbvXE9kACeLg04R4n22C4X3VInoJaaCqyI MdkAoJjgj0qwONIKM425QHh/Glxpo4gm =yOaG -----END PGP SIGNATURE-----

-- Fletcher Cocquyt Senior Systems Administrator Information Resources and Technology (IRT) Stanford University School of Medicine

Email: fcocquyt@stanford.edu Phone: (650) 724-7485

Tim Bell

11:37 p.m.

New subject: Python process size grows 30x in 8 hours (memory leak?)

Back at the beginning of this thread, Fletcher Cocquyt wrote:

...

Here are two references... there are many more if you start searching:

Identifying Memory Management Bugs Within Applications Using the libumem Library http://access1.sun.com/techarticles/libumem.html

Solaris Modular Debugger Guide http://docs.sun.com/db/doc/806-6545

Hope this helps - this is too long, so I'll stop now.

Tim

Brad Knowles

June 2008

9:57 a.m.

Fletcher Cocquyt wrote:

...

Search the FAQ for performance. The short URL for the web page is <http://wiki.list.org/x/AgA3>.

-- Brad Knowles <brad@python.org> Member of the Python.org Postmaster Team & Co-Moderator of the mailman-users and mailman-developers mailing lists

6072

Age (days ago)

6085

Last active (days ago)

List overview

Download

33 comments

7 participants

participants (7)

Barry Warsaw
Brad Knowles
Brad Knowles
Fletcher Cocquyt
Mark Sapiro
Tim Bell
Vidiot

Options for increasing throughput

Fletcher Cocquyt

Fletcher Cocquyt

Fletcher Cocquyt

MB

Fletcher Cocquyt

MB

Fletcher Cocquyt

Fletcher Cocquyt

Fletcher Cocquyt

Fletcher Cocquyt

Fletcher Cocquyt

Fletcher Cocquyt

Fletcher Cocquyt

Fletcher Cocquyt

Tim Bell

Fletcher Cocquyt

Fletcher Cocquyt

MB

Fletcher Cocquyt

MB

Fletcher Cocquyt

Fletcher Cocquyt

Fletcher Cocquyt

Fletcher Cocquyt

Fletcher Cocquyt

Fletcher Cocquyt

Fletcher Cocquyt

Fletcher Cocquyt

Tim Bell

tags

participants (7)