[Twisted-Python] Twisted Python vs. "Blocking" Python: Weird performance on small operations.
Hello Everyone! My name is Dirk Moors, and since 4 years now, I've been involved in developing a cloud computing platform, using Python as the programming language. A year ago I discovered Twisted Python, and it got me very interested, upto the point where I made the decision to convert our platform (in progress) to a Twisted platform. One year later I'm still very enthousiastic about the overal performance and stability, but last week I encountered something I did't expect; It appeared that it was less efficient to run small "atomic" operations in different deferred-callbacks, when compared to running these "atomic" operations together in "blocking" mode. Am I doing something wrong here? To prove the problem to myself, I created the following example (Full source- and test code is attached): --------------------------------------------------------------------------------------------------------------------------------------------------------------------- import struct def int2binAsync(anInteger): def packStruct(i): #Packs an integer, result is 4 bytes return struct.pack("i", i) d = defer.Deferred() d.addCallback(packStruct) reactor.callLater(0, d.callback, anInteger) return d def bin2intAsync(aBin): def unpackStruct(p): #Unpacks a bytestring into an integer return struct.unpack("i", p)[0] d = defer.Deferred() d.addCallback(unpackStruct) reactor.callLater(0, d.callback, aBin) return d def int2binSync(anInteger): #Packs an integer, result is 4 bytes return struct.pack("i", anInteger) def bin2intSync(aBin): #Unpacks a bytestring into an integer return struct.unpack("i", aBin)[0] --------------------------------------------------------------------------------------------------------------------------------------------------------------------- While running the testcode I got the following results: (1 run = converting an integer to a byte string, converting that byte string back to an integer, and finally checking whether that last integer is the same as the input integer.) *** Starting Synchronous Benchmarks. *(No Twisted => "blocking" code)* -> Synchronous Benchmark (1 runs) Completed in 0.0 seconds. -> Synchronous Benchmark (10 runs) Completed in 0.0 seconds. -> Synchronous Benchmark (100 runs) Completed in 0.0 seconds. -> Synchronous Benchmark (1000 runs) Completed in 0.00399994850159 seconds. -> Synchronous Benchmark (10000 runs) Completed in 0.0369999408722 seconds. -> Synchronous Benchmark (100000 runs) Completed in 0.362999916077 seconds. *** Synchronous Benchmarks Completed in* 0.406000137329* seconds. *** Starting Asynchronous Benchmarks . *(Twisted => "non-blocking" code)* -> Asynchronous Benchmark (1 runs) Completed in 34.5090000629 seconds. -> Asynchronous Benchmark (10 runs) Completed in 34.5099999905 seconds. -> Asynchronous Benchmark (100 runs) Completed in 34.5130000114 seconds. -> Asynchronous Benchmark (1000 runs) Completed in 34.5859999657 seconds. -> Asynchronous Benchmark (10000 runs) Completed in 35.2829999924 seconds. -> Asynchronous Benchmark (100000 runs) Completed in 41.492000103 seconds. *** Asynchronous Benchmarks Completed in *42.1460001469* seconds. Am I really seeing factor 100x?? I really hope that I made a huge reasoning error here but I just can't find it. If my results are correct then I really need to go and check my entire cloud platform for the places where I decided to split functions into atomic operations while thinking that it would actually improve the performance while on the contrary it did the opposit. I personaly suspect that I lose my cpu-cycles to the reactor scheduling the deferred-callbacks. Would that assumption make any sense? The part where I need these conversion functions is in marshalling/protocol reading and writing throughout the cloud platform, which implies that these functions will be called constantly so I need them to be superfast. I always though I had to split the entire marshalling process into small atomic (deferred-callback) functions to be efficient, but these figures tell me otherwise. I really hope someone can help me out here. Thanks in advance, Best regards, Dirk Moors
Dirk, Using deferred directly in your bin2intAsync() may be somewhat less efficient than some other way described in Recipe 439358: [Twisted] From blocking functions to deferred functions recipe (http://code.activestate.com/recipes/439358/) You would get same effect (asynchronous execution) - but potentially more efficiently - by just decorating your synchronous methods as: from twisted.internet.threads import deferToThread deferred = deferToThread.__get__ .... @deferred def int2binAsync(anInteger): #Packs an integer, result is 4 bytes return struct.pack("i", anInteger) @deferred def bin2intAsync(aBin): #Unpacks a bytestring into an integer return struct.unpack("i", aBin)[0] Kind regards, Valeriy Pogrebitskiy vpogrebi@verizon.net On Oct 13, 2009, at 9:18 AM, Dirk Moors wrote:
Hello Everyone!
My name is Dirk Moors, and since 4 years now, I've been involved in developing a cloud computing platform, using Python as the programming language. A year ago I discovered Twisted Python, and it got me very interested, upto the point where I made the decision to convert our platform (in progress) to a Twisted platform. One year later I'm still very enthousiastic about the overal performance and stability, but last week I encountered something I did't expect;
It appeared that it was less efficient to run small "atomic" operations in different deferred-callbacks, when compared to running these "atomic" operations together in "blocking" mode. Am I doing something wrong here?
To prove the problem to myself, I created the following example (Full source- and test code is attached): --------------------------------------------------------------------------------------------------------------------------------------------------------------------- import struct
def int2binAsync(anInteger): def packStruct(i): #Packs an integer, result is 4 bytes return struct.pack("i", i)
d = defer.Deferred() d.addCallback(packStruct)
reactor.callLater(0, d.callback, anInteger)
return d
def bin2intAsync(aBin): def unpackStruct(p): #Unpacks a bytestring into an integer return struct.unpack("i", p)[0]
d = defer.Deferred() d.addCallback(unpackStruct)
reactor.callLater(0, d.callback, aBin) return d
def int2binSync(anInteger): #Packs an integer, result is 4 bytes return struct.pack("i", anInteger)
def bin2intSync(aBin): #Unpacks a bytestring into an integer return struct.unpack("i", aBin)[0]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
While running the testcode I got the following results:
(1 run = converting an integer to a byte string, converting that byte string back to an integer, and finally checking whether that last integer is the same as the input integer.)
*** Starting Synchronous Benchmarks. (No Twisted => "blocking" code) -> Synchronous Benchmark (1 runs) Completed in 0.0 seconds. -> Synchronous Benchmark (10 runs) Completed in 0.0 seconds. -> Synchronous Benchmark (100 runs) Completed in 0.0 seconds. -> Synchronous Benchmark (1000 runs) Completed in 0.00399994850159 seconds. -> Synchronous Benchmark (10000 runs) Completed in 0.0369999408722 seconds. -> Synchronous Benchmark (100000 runs) Completed in 0.362999916077 seconds. *** Synchronous Benchmarks Completed in 0.406000137329 seconds.
*** Starting Asynchronous Benchmarks . (Twisted => "non-blocking" code) -> Asynchronous Benchmark (1 runs) Completed in 34.5090000629 seconds. -> Asynchronous Benchmark (10 runs) Completed in 34.5099999905 seconds. -> Asynchronous Benchmark (100 runs) Completed in 34.5130000114 seconds. -> Asynchronous Benchmark (1000 runs) Completed in 34.5859999657 seconds. -> Asynchronous Benchmark (10000 runs) Completed in 35.2829999924 seconds. -> Asynchronous Benchmark (100000 runs) Completed in 41.492000103 seconds. *** Asynchronous Benchmarks Completed in 42.1460001469 seconds.
Am I really seeing factor 100x??
I really hope that I made a huge reasoning error here but I just can't find it. If my results are correct then I really need to go and check my entire cloud platform for the places where I decided to split functions into atomic operations while thinking that it would actually improve the performance while on the contrary it did the opposit.
I personaly suspect that I lose my cpu-cycles to the reactor scheduling the deferred-callbacks. Would that assumption make any sense? The part where I need these conversion functions is in marshalling/ protocol reading and writing throughout the cloud platform, which implies that these functions will be called constantly so I need them to be superfast. I always though I had to split the entire marshalling process into small atomic (deferred-callback) functions to be efficient, but these figures tell me otherwise.
I really hope someone can help me out here.
Thanks in advance, Best regards, Dirk Moors
<twistedbenchmark.py>_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
Hi Dirk, I took a look at your code sample and got the async benchmark to run with the following values: *** Starting Asynchronous Benchmarks. -> Asynchronous Benchmark (1 runs) Completed in 0.000181913375854 seconds. -> Asynchronous Benchmark (10 runs) Completed in 0.000736951828003 seconds. -> Asynchronous Benchmark (100 runs) Completed in 0.00641012191772 seconds. -> Asynchronous Benchmark (1000 runs) Completed in 0.0741751194 seconds. -> Asynchronous Benchmark (10000 runs) Completed in 0.675071001053 seconds. -> Asynchronous Benchmark (100000 runs) Completed in 7.29738497734 seconds. *** Asynchronous Benchmarks Completed in 8.16032314301 seconds. Which, though still quite a bit slower than the synchronous version, is still much better than the 40 sec. mark that you were experiencing. My modified version simply returned defer.succeed from your aync block-compute functions. i.e. Instead of your initial example: def int2binAsync(anInteger): def packStruct(i): #Packs an integer, result is 4 bytes return struct.pack("i", i) d = defer.Deferred() d.addCallback(packStruct) reactor.callLater(0, d.callback, anInteger) return d my version does: def int2binAsync(anInteger): return defer.succeed(struct.pack('i', anInteger)) A few things to note in general however: 1) Twisted shines for block I/O operations - i.e. networking. A compute intesive process will not necessarily yield any gains in performance by using Twisted since the Python GIL exists (a global lock). 2) If you are doing computations that use a C module (unforunately struct pre 2.6 I believe doesn't use a C module), there may be a chance that the C module releases the GIL, allowing you to do those computations in a thread. In this case you'd be better off using deferToThread as suggested earlier. 3) There is some (usually minimal but it exists) overhead to using Twisted. Instead of computing a bunch of stuff serially and returning your answer as in your sync example, you're wrapping everything up in deferreds and starting a reactor - it's definitely going to be a bit slower than the pure synchronous version for this case. Hope that makes sense. Cheers, Reza -- Reza Lotun mobile: +44 (0)7521 310 763 email: rlotun@gmail.com work: reza@tweetdeck.com twitter: @rlotun
participants (3)
-
Dirk Moors
-
Reza Lotun
-
Valeriy Pogrebitskiy