Re: [Twisted-Python] Twisted server is 5 times SLOWER on Solaris than Linux?
On 11:35 pm, jarrod@vertigrated.com wrote:
There is a "backend" C module that our Twisted server front ends, and it is highly multi-threaded. So the T1000 is PERFECT for our application, except that now Twisted is the bottleneck. :-(
This seems odd to me. If all the CPUs are going to be busy doing a multi-threaded back-end's work, and Twisted is just doing the I/O, then it seems the T1000 would still be a benefit. The benchmark you mentioned was completely static; there was no backend library, no multithreaded CPU load. Is the performance disparity similar when you're running actual workloads? Sure, Twisted isn't going to be able to dole out as much work as something optimized to balance the I/O management CPU across N+1 cores; but if those cores are going to be busy anyway in realistic use, then presumably having Twisted contending for all of them wouldn't be much of a performance boost. I actually do have a little experience with Twisted-*like* software on Solaris, although not Twisted itself. The proprietary system which originally inspired Twisted's networking core was actually designed to run on Solaris, and took Sparc hardware advantages into account. It still ran all of its I/O in a single thread.
So we either scrap our Twisted implementation and have to spend extra time on another network handling layer, or run 5 times as many instances of our server to service the same number of concurrent clients.
Congratulations. For years, I've been warning people that Twisted cannot transparently take full advantage of vertical scaling with SMP. While I've heard a lot of uninformed whinging about how this is a huge problem, you are the first person to report an actual performance problem related to that fact :). Running 5 times as many instances of the server does make sense, and shouldn't have a significant downside. The parallelism strategy I've used pretty much everywhere is multiprocessing rather than multithreading, and it works well. If the issue is that you don't want to have that many different open ports on each machine, would it be possible to have a small front-end server accept()ing and sending sockets to N+1 (where N is the number of cores) other Twisted processes? I don't know how this might be accomplished on solaris, but if it's possible, it should be transparent to the clients and let Twisted itself take advantage of the hardware. It would take some work, but not as much as a rewrite. Again, it seems weird to me that this is necessary if the back-end library is really utilizing all the CPUs already and you are not I/O bound.
On 1/17/07, glyph@divmod.com <glyph@divmod.com> wrote:
On 11:35 pm, jarrod@vertigrated.com wrote:
There is a "backend" C module that our Twisted server front ends, and it is highly multi-threaded. So the T1000 is PERFECT for our application, except that now Twisted is the bottleneck. :-(
This seems odd to me.
If all the CPUs are going to be busy doing a multi-threaded back-end's work, and Twisted is just doing the I/O, then it seems the T1000 would still be a benefit. The benchmark you mentioned was completely static; there was no backend library, no multithreaded CPU load. Is the performance disparity similar when you're running actual workloads?
snipped a lot of good information :-) Again, it seems weird to me that this is necessary if the back-end library
is really utilizing all the CPUs already and you are not I/O bound.
Here is what we are doing basically. Twisted takes in data and in a C extension we send the data to multiple backends in parallel to do processing on it. Then we aggregate the results and send information back to the client. This is basically a fancy proxy that parallelizes and distributes work to other machines on the network. All the clients run in "keep-alive" mode, so they don't create new connections for each piece of work they send to the system, so once they are all connected, they stay connected for their lifetime ( long time ). On the Dell 2850's without any backend code, we see 600ms latency with a test suite of 400 clients. With the Solaris SPARC machines T1000 and V210 we see 4000 - 5000 ms latency with the same no-op code and the same 400 clients. With the backend code we see about an additional 250ms latency on both platforms, since the "backend" code is just taking the data and sending it out across the network to process, it just sits waiting on responses. The backend code is just not doing enough work to stress the machine basically. We have LOTS and LOTS of test harness code and profiling code to pinpoint where bottlenecks are. We are going to have process a couple of terabytes a day thru this system. Latency thru the system is a high priority because of what kind of system it is. We can get up to about 1400 clients on the Dell 2850 hardware before latency starts climbing out of control. The SPARC hardware is falling over at 400 clients :-( Thanks to everyone for all the ideas and help.
participants (2)
-
glyph@divmod.com
-
Jarrod Roberson