reducing multiprocessing.Queue contention

Hello, Currently, multiprocessing.Queue put() and get() methods hold locks for the entire duration of the writing/reading to the backing Connection (which can be a pipe, unix domain socket, or whatever it's called on Windows). For example, here's what the feeder thread does: """ else: wacquire() try: send(obj) # Delete references to object. See issue16284 del obj finally: wrelease() """ Connection.send() and Connection.recv() have to serialize the data using pickle before writing them to the underlying file descriptor. While the locking is necessary to guarantee atomic read/write (well, it's not necessary if you're writing to a pipe less than PIPE_BUF, and writes seem atomic on Windows), the locks don't have to be held while the data is serialized. Although I didn't make any measurement, my gut feeling is that this serializing can take a non negligible part of the overall sending/receiving time, for large data items. If that's the case, then simply holding the lock for the duration of the read()/write() syscall (and not during serialization) could reduce contention in case of large data sending/receiving. One way to do that would be to refactor the code a bit to provide maybe a (private) AtomicConnection, which would encapsulate the necessary locking: another advantage is that this would hide the platform-dependent code inside Connection (right now, Queue only uses a lock for ending on Unix platforms, since write is apparently atomic on Windows). Thoughts?

On 23/01/2013 11:16am, Charles-François Natali wrote:
But you can only rely on the atomicity of writing less than PIPE_BUF bytes if you know that no other process is currently trying to send a message longer than PIPE_BUF. Otherwise the short message could be embedded in the long message (even if the process sending the long message is holding the lock). -- Richard

Maybe I wasn't clear. I'm not suggesting to not hold the lock when sending less than PIPE_BUF, since it wouldn't work in the case you describe above. I'm suggesting to serialize the data prior to acquiring the writer lock, to reduce contention (and unserialize after releasing the reading lock). (I only mentioned PIPE_BUF because I was sad to see that Windows supported atomic messages, and this comforted me a bit :-)

I was curious, so I wrote a quick and dirty patch (it's doesn't support timed get()/put(), so I won't post it here). I used the attached script as benchmark: basically, it just spawns a bunch of processes that put()/get() to a queue some data repeatedly (10000 times a list of 1024 ints), and returns when everything has been sent and received. The following tests have been made on an 8-cores box, from 1 reader/1 writer up to 4 readers/4 writers (it would be interesting to see with only 1 writer and multiple readers, but readers would keep waiting for input so it requires another benchmark): Without patch: """ $ ./python /tmp/multi_queue.py took 0.7993290424346924 seconds with 1 workers took 1.8892168998718262 seconds with 2 workers took 3.075777053833008 seconds with 3 workers took 4.050479888916016 seconds with 4 workers """ With patch: """ $ ./python /tmp/multi_queue.py took 0.7730131149291992 seconds with 1 workers took 0.7471320629119873 seconds with 2 workers took 0.752316951751709 seconds with 3 workers took 0.8303961753845215 seconds with 4 workers """

On 23/01/2013 11:16am, Charles-François Natali wrote:
But you can only rely on the atomicity of writing less than PIPE_BUF bytes if you know that no other process is currently trying to send a message longer than PIPE_BUF. Otherwise the short message could be embedded in the long message (even if the process sending the long message is holding the lock). -- Richard

Maybe I wasn't clear. I'm not suggesting to not hold the lock when sending less than PIPE_BUF, since it wouldn't work in the case you describe above. I'm suggesting to serialize the data prior to acquiring the writer lock, to reduce contention (and unserialize after releasing the reading lock). (I only mentioned PIPE_BUF because I was sad to see that Windows supported atomic messages, and this comforted me a bit :-)

I was curious, so I wrote a quick and dirty patch (it's doesn't support timed get()/put(), so I won't post it here). I used the attached script as benchmark: basically, it just spawns a bunch of processes that put()/get() to a queue some data repeatedly (10000 times a list of 1024 ints), and returns when everything has been sent and received. The following tests have been made on an 8-cores box, from 1 reader/1 writer up to 4 readers/4 writers (it would be interesting to see with only 1 writer and multiple readers, but readers would keep waiting for input so it requires another benchmark): Without patch: """ $ ./python /tmp/multi_queue.py took 0.7993290424346924 seconds with 1 workers took 1.8892168998718262 seconds with 2 workers took 3.075777053833008 seconds with 3 workers took 4.050479888916016 seconds with 4 workers """ With patch: """ $ ./python /tmp/multi_queue.py took 0.7730131149291992 seconds with 1 workers took 0.7471320629119873 seconds with 2 workers took 0.752316951751709 seconds with 3 workers took 0.8303961753845215 seconds with 4 workers """
participants (4)
-
Antoine Pitrou
-
Charles-François Natali
-
Eli Bendersky
-
Richard Oudkerk