I believe I've found out how to reliably reproduce the performance problemsI've noticed here at VA and at Kanga.Nu, and which Barry and another (forget name, sorry) have observed as well:
Create a moderated list.
Subscribe 200 addresses to the list (can be bogus addresses but the local MTA must accept them)
Post at least 30 messages of an average of at least 2K size to the list.
Go to the moderation page, approve every message, and hit submit.
Watch your system load peg and stay there for an obscenely long time.
-- J C Lawrence Home: claw@kanga.nu ---------(*) Linux/IA64 - Work: claw@varesearch.com ... Beware of cromagnons wearing chewing gum and palm pilots ...
"JCL" == J C Lawrence <claw@varesearch.com> writes:
JCL> I believe I've found out how to reliably reproduce the
JCL> performance problemsI've noticed here at VA and at Kanga.Nu,
JCL> and which Barry and another (forget name, sorry) have
JCL> observed as well:
JCL> 1) Create a moderated list.
JCL> 2) Subscribe 200 addresses to the list (can be bogus
JCL> addresses but the local MTA must accept them)
JCL> 3) Post at least 30 messages of an average of at least 2K
JCL> size to the list.
JCL> 4) Go to the moderation page, approve every message, and hit
JCL> submit.
JCL> 5) Watch your system load peg and stay there for an
JCL> obscenely long time.
Just a quick note 'cause I have very little time. I'm currently seeing python.org massively pegged, and Guido and I were talking about some Python tools we'd like to develop that would help debug situations like this. What I wanted was something like gdb's ability to attach to and print stack traces of running external programs. We got into some brainstorming and came up with A Certified Very Cool Trick[1].
This yielded a traceback for where at least two pegged processes are spinning. Seems to make sense, but I'm not very familar with the archiving guts, so I post this traceback to spur some discussion. Maybe Scott or Harald can craft a fix.
Here's the traceback:
Looks like the archiver is doing way too much work for every message it has to process. When python.org came back up today, it got slammed with incoming mail for a bazillion lists. Each message spins in this HyperDatabase.clearIndex() loop.
-Barry
[1] CVCT:
Use gdb to attach to the running Python program, then type this at the gdb prompt:
(gdb) call PyRun_SimpleString("import sys, traceback; sys.stderr=open('/tmp/tb','w',0); traceback.print_stack()")
Sitting in /tmp/tb will be the stack trace of where the Python program was when you stopped it. There's reason to believe this will not always work, but it likely will, and you can even detach the program and let it continue on.
Just a quick note 'cause I have very little time. I'm currently seeing python.org massively pegged, and Guido and I were talking about some Python tools we'd like to develop that would help debug situations like this. What I wanted was something like gdb's ability to attach to and print stack traces of running external programs. We got into some brainstorming and came up with A Certified Very Cool Trick[1].
This yielded a traceback for where at least two pegged processes are spinning. Seems to make sense, but I'm not very familar with the archiving guts, so I post this traceback to spur some discussion. Maybe Scott or Harald can craft a fix.
Hmm... A few thoughts..
I've never seen the load problem on my mailman site, even though I run several reasonabally well trafficked lists, HOWEVER, I run a slightly customized version of HyperArch.py which stil.l uses bsddb for data storage.
Also, my site does not immedeately archive messages, it runs an archiving cronjob every few hours.(it still dosen't draw much cpu whence the cronjob kicks in, tho)
really, this archiving system was never meant to be used in the
current method of operation (being invoked on each incoming message), and it's design is probably rather non-optimal for this use. It spends substantial time building & tearing down & (un)pickling various structures incurring a bit of overhead.
-The Dragon De Monsyne
"TDDM" == The Dragon De Monsyne <dragondm@integral.org> writes:
TDDM> Hmm... A few thoughts.. I've never seen the load
TDDM> problem on my mailman site, even though I run several
TDDM> reasonabally well trafficked lists, HOWEVER, I run a
TDDM> slightly customized version of HyperArch.py which stil.l
TDDM> uses bsddb for data storage.
TDDM> Also, my site does not immedeately archive messages, it runs
TDDM> an archiving cronjob every few hours.(it still dosen't draw
TDDM> much cpu whence the cronjob kicks in, tho)
Both those differences (using bsddb and not archiving immediately) explains why you wouldn't see the hit. The bug (now fixed) was in the DumbBTree implementation.
TDDM> really, this archiving system was never meant to be
TDDM> used in the current method of operation (being invoked on
TDDM> each incoming message), and it's design is probably rather
TDDM> non-optimal for this use. It spends substantial time
TDDM> building & tearing down & (un)pickling various structures
TDDM> incurring a bit of overhead.
Wonderful understatement Dragon! :)
Running with the patches in the CVS tree, I think the current system can work for a site as heavily trafficked as python.org, but it is very inefficient. Maybe we should document that for high traffic sites, you might want to use an external archiver which only runs from a cronjob. Unfortunately, I haven't done this, so if anybody can contribute a HOWTO, I'd appreciate it.
-Barry
I think I have identified at least one performance bottleneck with Mailman. Hopefully, /the/ only bottleneck :) I think the culprit is in HyperDatabase.py, namely the DumbBTree class. This stuff is the interface b/w Mailman and Pipermail, and as such I am really quite unfamilar with this code, but using the trick I outlined in a previous message, I found that most of the time when I printed the stack trace, I found myself in HyperDatabase.clearIndex().
I think the algorithm of using key=dumbtree.next(); del dumbtree[key] is extremely inefficient. Take a look at DumbBTree.__delitem__() to get the picture.
So here's an experimental patch to add a clear() method to the DumbBTree class, which clearIndex() will use if available, falling back to the old approach, which I assume is some API standard for bsddb btrees -- which Mailman doesn't use currently.
Near as I can tell, this doesn't break anything, archive threads still get created properly, and while I haven't tested it live on python.org, it ought to speed at least this part up a lot. We'll see if this fixes the problem some of us have seen.
I'm going to try to test this some more before I check it in. I may install it on python.org to see what happens. I'd love some feedback. Does it solve the performance problems? Does anything break because of this patch? Do we need to investigate further?
-Barry
-------------------- snip snip -------------------- Index: HyperDatabase.py
RCS file: /projects/cvsroot/mailman/Mailman/Archiver/HyperDatabase.py,v retrieving revision 1.3 diff -c -r1.3 HyperDatabase.py *** HyperDatabase.py 1998/11/04 23:49:03 1.3 --- HyperDatabase.py 1999/06/30 22:12:24
*** 88,95 **** else: self.current_index = self.current_index + 1
! !
def first(self):
if not self.sorted:
--- 88,97 ---- else: self.current_index = self.current_index + 1
! def clear(self): ! # bulk clearing much faster than deleting each item, esp. with the ! # implementation of __delitem__() above :( ! self.dict = {}
def first(self):
if not self.sorted:
*** 296,302 **** def newArchive(self, archive): pass def clearIndex(self, archive, index): self.__openIndices(archive) ! index=getattr(self, index+'Index') finished=0 try: key, msgid=self.threadIndex.first() --- 298,307 ---- def newArchive(self, archive): pass def clearIndex(self, archive, index): self.__openIndices(archive) ! ## index=getattr(self, index+'Index') ! if hasattr(self.threadIndex, 'clear'): ! self.threadIndex.clear() ! return finished=0 try: key, msgid=self.threadIndex.first()
participants (3)
-
Barry A. Warsaw
-
J C Lawrence
-
The Dragon De Monsyne