
Aahz wrote:
Maybe that's what it seems to you; to others of us who have been looking at this problem for a while, the real question is how to get a better multi-process control and IPC library in Python, preferably one that is cross-platform. You can investigate that right now, and you don't even need to discuss it with other people.
(Despite my oft-stated fondness for threading, I do recognize the problems with threading, and if there were a way to make processes as simple as threads from a programming standpoint, I'd be much more willing to push processes.)
<plug> The processing package at http://cheeseshop.python.org/pypi/processing is multi-platform and mostly follows the API of threading. It also allows use of 'shared objects' which live in a manager process. For example the following code is almost identical to the equivalent written with threads: from processing import Process, Manager def f(q): for i in range(10): q.put(i*i) q.put('STOP') if __name__ == '__main__': manager = Manager() queue = manager.Queue(maxsize=3) p = Process(target=f, args=[queue]) p.start() result = None while result != 'STOP': result = queue.get() print result p.join() Josiah wrote:
Without work, #2 isn't "fast" using processes in Python. It is trivial using threads. But here's the thing: with work, #2 can be made fast. Using unix domain sockets (on linux, 3.4 ghz P4 Xeons, DDR2-PC4200 memory (you can get 50% faster memory nowadays)), I've been able to push 400 megs/second between processes. Maybe anonymous or named pipes, or perhaps a shared mmap with some sort of synchronization would allow for IPC to be cross platform and just about as fast.
The IPC uses sockets or (on Windows) named pipes. Linux and Windows are roughly equal in speed. On a P4 2.5Ghz laptop one can retreive an element from a shared dict about 20,000 times/sec. Not sure if that qualifies as fast enough. </plug> Richard

"Richard Oudkerk" <r.m.oudkerk@googlemail.com> wrote:
Josiah wrote:
Without work, #2 isn't "fast" using processes in Python. It is trivial using threads. But here's the thing: with work, #2 can be made fast. Using unix domain sockets (on linux, 3.4 ghz P4 Xeons, DDR2-PC4200 memory (you can get 50% faster memory nowadays)), I've been able to push 400 megs/second between processes. Maybe anonymous or named pipes, or perhaps a shared mmap with some sort of synchronization would allow for IPC to be cross platform and just about as fast.
The IPC uses sockets or (on Windows) named pipes. Linux and Windows are roughly equal in speed. On a P4 2.5Ghz laptop one can retreive an element from a shared dict about 20,000 times/sec. Not sure if that qualifies as fast enough.
Depends on what the element is, but I suspect it isn't fast enough. Fairly large native dictionaries seem to run on the order of 1.3 million fetches/second on my 2.8 ghz machine. >>> import time >>> d = dict.fromkeys(xrange(65536)) >>> if 1: ... t = time.time() ... for j in xrange(1000000): ... _ = d[j&65535] ... print 1000000/(time.time()-t) ... 1305482.97346 >>> But really, transferring little bits of data back and forth isn't what is of my concern in terms of speed. My real concern is transferring nontrivial blocks of data; I usually benchmark blocks of sizes: 1k, 4k, 16k, 64k, 256k, 1M, 4M, 16M, and 64M. Those are usually pretty good to discover the "sweet spot" for a particular implementation, and also allow a person to discover whether or not their system can be used for nontrivial processor loads. - Josiah

On 3/25/07, Josiah Carlson <jcarlson@uci.edu> wrote:
But really, transferring little bits of data back and forth isn't what is of my concern in terms of speed. My real concern is transferring nontrivial blocks of data; I usually benchmark blocks of sizes: 1k, 4k, 16k, 64k, 256k, 1M, 4M, 16M, and 64M. Those are usually pretty good to discover the "sweet spot" for a particular implementation, and also allow a person to discover whether or not their system can be used for nontrivial processor loads.
Not directly relevant to the discussion, but I attended recently a talk from the main developer of STXXL (http://stxxl.sourceforge.net/), an STL-compatible library for handling huge volumes of data. The keys to efficient processing are support for parallel disks, explicit overlapping between I/O and computation, and I/O pipelining. More details are available at http://i10www.ira.uka.de/dementiev/stxxl/report/. George

On 26/03/07, Josiah Carlson <jcarlson@uci.edu> wrote:
But really, transferring little bits of data back and forth isn't what is of my concern in terms of speed. My real concern is transferring nontrivial blocks of data; I usually benchmark blocks of sizes: 1k, 4k, 16k, 64k, 256k, 1M, 4M, 16M, and 64M. Those are usually pretty good to discover the "sweet spot" for a particular implementation, and also allow a person to discover whether or not their system can be used for nontrivial processor loads.
The "20,000 fetches/sec" was just for retreving a "small" object (an integer), so it only really reflects the server overhead. (Sending integer objects directly between processes is maybe 6 times faster.) Fetching string objects of particular sizes from a shared dict gives the following results on the same computer: string size fetches/sec throughput ----------- ----------- ---------- 1 kb 15,000 15 Mb/s 4 kb 13,000 52 Mb/s 16 kb 8,500 130 Mb/s 64 kb 1,800 110 Mb/s 256 kb 196 49 Mb/s 1 Mb 50 50 Mb/s 4 Mb 13 52 Mb/s 16 Mb 3.2 51 Mb/s 64 Mb 0.84 54 Mb/s

"Richard Oudkerk" <r.m.oudkerk@googlemail.com> wrote:
On 26/03/07, Josiah Carlson <jcarlson@uci.edu> wrote:
But really, transferring little bits of data back and forth isn't what is of my concern in terms of speed. My real concern is transferring nontrivial blocks of data; I usually benchmark blocks of sizes: 1k, 4k, 16k, 64k, 256k, 1M, 4M, 16M, and 64M. Those are usually pretty good to discover the "sweet spot" for a particular implementation, and also allow a person to discover whether or not their system can be used for nontrivial processor loads.
The "20,000 fetches/sec" was just for retreving a "small" object (an integer), so it only really reflects the server overhead. (Sending integer objects directly between processes is maybe 6 times faster.)
That's a positive sign.
Fetching string objects of particular sizes from a shared dict gives the following results on the same computer:
Those numbers look pretty good. Would I be correct in assuming that there is a speedup sending blocks directly between processes? (though perhaps not the 6x that integer sending gains) I will definitely have to dig deeper, this could be the library that we've been looking for. - Josiah

On 27/03/07, Josiah Carlson <jcarlson@uci.edu> wrote:
Those numbers look pretty good. Would I be correct in assuming that there is a speedup sending blocks directly between processes? (though perhaps not the 6x that integer sending gains)
Yes, sending blocks directly between processes is over 3 times faster for 1k blocks, and twice as fast for 4k blocks, but after that it makes little difference. (This is using the 'processing.connection' sub-package which is partly written in C.) Of course since these blocks are string data you can avoid the pickle translation which makes things get faster still: the peak bandwidth I get is 40,000 x 16k blocks / sec = 630 Mb/s. PS. It would be nice if the standard library had support for sending message oriented data over a connection so that you could just do 'recv()' and 'send()' without worrying about whether the whole message was successfully read/written. You can use 'socket.makefile()' for line oriented text messages but not for binary data.

"Richard Oudkerk" <r.m.oudkerk@googlemail.com> wrote:
On 27/03/07, Josiah Carlson <jcarlson@uci.edu> wrote:
Those numbers look pretty good. Would I be correct in assuming that there is a speedup sending blocks directly between processes? (though perhaps not the 6x that integer sending gains)
Yes, sending blocks directly between processes is over 3 times faster for 1k blocks, and twice as fast for 4k blocks, but after that it makes little difference. (This is using the 'processing.connection' sub-package which is partly written in C.)
I'm surprised that larger objects see little gain from the removal of an encoding/decoding step and transfer.
Of course since these blocks are string data you can avoid the pickle translation which makes things get faster still: the peak bandwidth I get is 40,000 x 16k blocks / sec = 630 Mb/s.
Very nice.
PS. It would be nice if the standard library had support for sending message oriented data over a connection so that you could just do 'recv()' and 'send()' without worrying about whether the whole message was successfully read/written. You can use 'socket.makefile()' for line oriented text messages but not for binary data.
Well, there's also the problem that sockets, files, and pipes behave differently on Windows. If one is only concerned about sockets, there are various lightly defined protocols that can be simply implemented on top of asyncore/asynchat, among them is the sending of a 32 bit length field in network-endian order, followed by the data to be sent immediately afterwards. Taking some methods and tossing them into a synchronous sockets package wouldn't be terribly difficult (I've done a variant of this for a commercial project). Doing this generally may not find support, as my idea of sharing encoding/decoding/internal state transition/etc in sync/async servers was shot down at least a year ago. - Josiah

On 28/03/07, Josiah Carlson <jcarlson@uci.edu> wrote:
Well, there's also the problem that sockets, files, and pipes behave differently on Windows.
Windows named pipes have a native message mode.
If one is only concerned about sockets, there are various lightly defined protocols that can be simply implemented on top of asyncore/asynchat, among them is the sending of a 32 bit length field in network-endian order, followed by the data to be sent immediately afterwards.
That's exactly what I was doing.
participants (3)
-
George Sakkis
-
Josiah Carlson
-
Richard Oudkerk