[Twisted-Python] pushing out same message on 100k TCPs

Hi there, what is the most efficient/performant way of doing the following? I have a short message prepared .. say a string of 100 octets. I want to push out that _same_ string on 100k connected TCPs (on a server). == My thinking was: ideally, the 100 bytes would be transferred to kernel/NIC space just _once_, and then the kernel is only told to resend that buffer on the 100k connected TCPs. Does that make sense, is that even possible, with Twisted, or in general? == Currently I do the naive thing: transport.write(msg) x 100k times .. which I guess involves quite some copying around and user to kernel space transitions .. \Tobias

AFAIK that's not possible in TCP, only in multicast, and the kernel will make copies of that string no matter what, but I am highly unknowledgeable in this area. On 2/10/12, Tobias Oberstein <tobias.oberstein@tavendo.de> wrote:
-- When the facts change, I change my mind. What do you do, sir? ~ Keynes Corbin Simpson <MostAwesomeDude@gmail.com>

On 10/02/12 16:56, Tobias Oberstein wrote:
Not really, no. The problem is that TCP requires the sender of data to buffer so that it can re-send. The only way to store one copy of the data whilst doing this would be to store the socket buffer as a (fairly complex) linked list of reference-counted blocks, and use scatter-gather IO to the network card. So the kernel would have to copy the data 100k times anyway, to store it in the per-socket buffer until it was ACKed, or maintain a large and complex data structure so that it could use one copy. Therefore, by moving the work to the kernel, all you've done is consume valuable kernel memory, in return for saving the syscall overhead. Classic space/time tradeoff. If you were using UDP, then in theory this might be possible, but there are no APIs that I know of, except for multicast (where you only send one copy of the data, and the network duplicates it). In short; this kind of thing seems easy and desirable but actually it's really hard and not useful.

Resends -- oh, yes, right. Forgot about that.
store the socket buffer as a (fairly complex) linked list of reference-counted blocks, and use scatter-gather IO to the network card.
Doesn't a (modern) kernel do something like that for virtual memory pages ie.? For if processes mmap files or load shared objects, the kernel must keep track of pages which are needed by active processes. So in theory, if my message is 1 VM page, kernel could pin that page (make it non-swappable etc) until all TCP ACKs have arrived ..
In short; this kind of thing seems easy and desirable but actually it's really hard
Probably quite hard ..
and not useful.
When using VM pages (_if_ that would be possible) and thus no data duplication, then why not useful? However I now can see that it's likely outside of what can be done with a stock kernel .. if it can be done at all anyway .. Thanks!

On 02/10/2012 08:20 PM, Tobias Oberstein wrote:
Possibly. My knowledge of kernel memory management is a lot more patchy than network stacks. One option you could investigate, that I was going to suggest in my original reply but didn't have the time to write up, is the sendfile() API. If you write your message to a temporary file, you could call sendfile() on all 100k connections using the same file descriptor. So, something like: fd = os.open(PATH, os.O_RDWR) os.write(fd, message) os.unlink(PATH) for connection in biglist: connection.sendfile(fd, offset=0, len=100) os.close(fd) Now, as I understand it, sendfile() will perform zero-copy IO; since the contents of the file will undoubtedly be in the page cache, it should in theory DMA the data straight from the (single copy of the) data in RAM to the NIC buffers. It should also handle refcounting for you - you unlink the filename after obtaining a descriptor, and close() the FD once you've called sendfile, and the kernel *should* in theory free the inode and page containing file data once all TCP ACKs have been received. You'll still have to make 100k syscalls, and you may find the kernel chooses to copy the data anyway. However - AFAIK Twisted does not support sendfile(), and it can be tricky to make it work with non-blocking IO. :o( You may also want to look at the splice() vmsplice() and tee() syscalls added to recent Linux kernels. tee() in particular can copy data from pipe to pipe without consuming, so can be repeated multiple times. It may be possible to assemble something that will do this task efficiently from those building blocks, but the APIs aren't available in Twisted.
Sorry, I should have been more precise - it's probably not often useful. There are not very many applications where sending the same TCP stream to that many clients at the same time is helpful - realtime video/audio over TCP spring to mind, and typically those need to adapt to slow clients by dropping them to a lower rate i.e. not the same stream any more. As Glyph has mentioned, encryption is also a factor in todays internet. I'm kind of curious about what your application is!

I see. So using sendfile .. probably with message as file on RAMFS .. or using the Linux syscalls you mention below, it _might_ be possible to avoid copy overhead, but not context switching overhead .. ok.
However - AFAIK Twisted does not support sendfile(), and it can be tricky to make it work with non-blocking IO.
;( Apart from that, we're on FreeBSD .. guess there are similar syscalls (maybe with slightly different semantics) there also.
Thanks alot! This is all very interesting .. from the "tee" man page: """ Though we talk of copying, actual copies are generally avoided. The kernel does this by implementing a pipe buffer as a set of reference-counted pointers to pages of kernel memory. The kernel creates "copies" of pages in a buffer by creating new pointers (for the output buffer) referring to the pages, and increasing the reference counts for the pages: only pointers are copied, not the pages of the buffer. """ Which sounds alot like in your other reply talking about refcounting etc .. For ref., these guys are talking about PACKET_MMAP http://www.linuxquestions.org/questions/programming-9/vectored-write-to-many... http://dank.qemfd.net/dankwiki/index.php/Fast_UNIX_Servers The former (very end of page) claims that it achieves zero-copy (which I get), and also claims you could reduce context switch overheader for the 1 msg TX to many clients case .. which I can't see how it's done.
The application is PubSub over WebSockets with massive numbers of clients .. Application message payloads are short (<1k) and JSON/UTF-8. Those are then framed into WebSocket messages (which basically means prepending a WS frame header).

On Feb 10, 2012, at 12:49 PM, Phil Mayers wrote:
Not to mention the fact that inevitably, you probably are going to want some security on those connections, which means TLS, which means individual crypto connections. I believe the best model for this kind of high-volume reliable-multicast-over-unicast is a spanning tree, like what IRC uses for server-to-server communication. If you have 100,000 concurrent, active connections, you're already probably beyond the CPU constraints of a single machine. So have your clients connect (via round-robin DNS, or whatever other mechanism makes sense) to a variety of different servers, then have the servers connected to each other in a pattern such that one server tells ten friends, and they each tell ten friends, and so on, until each server is only responsible for spewing out data to a reasonable number of connections (5000 maybe?). -glyph

On 02/10/2012 09:54 PM, Glyph Lefkowitz wrote:
I believe the best model for this kind of high-volume reliable-multicast-over-unicast is a spanning tree, like what IRC
For what it's worth, real IP multicast is quite commonly used for distributing short messages to many clients in realtime in some closed networks, such as financial trading systems. With good network equipment that can handle low- or zero-loss timely delivery, it does work very well. Over the internet - less well, and even if your target networks are multicast-enabled, something like NORM or FEC is required to fill in the gaps, which inevitably adds latency. Add to which the fact that you want your protocol to be congestion-control compatible, which is seriously challenging with real multicast, and if you want crypto you've got to have some kind of group ciphering system..
I agree, this is probably the best model at the moment. It has the advantage that if you need to scale globally, you can push the "caching" nodes closer to the clients, and use geographical-aware routing to push clients onto closer nodes.

Emphasis on "good network equipment";) Yes, stuff that a) even supports IP Multicast and b) as very low loss rate .. If possible, I'd like to avoid IP Multicast .. just a gut feeling (have no experience with that in practice) .. and apart from that, we currently just don't have IP Multicast aware switches deployed ;(

If there is a need for encryption, then yes, .. but it's not always needed
I believe the best model for this kind of high-volume reliable-multicast-over- unicast is a spanning tree, like what IRC uses for server-to-server
Ok. Makes sense. I see .. could have frontend nodes, which act as aggregators condensing the messages on many incoming connections into 1 uplink TCP connection to a backend node .. or a hierarchy. This is quite interesting .. for I am now beginning to see why Google is pushing the discussion on WebSocket multiplexing extension on Hybi;) Until now, I thought that would be primarily because Chrome has a process per tab design .. but that's probably only one reaon. The cloud side being the other one ..
communication. If you have 100,000 concurrent, active connections, you're already probably beyond the CPU constraints of a single machine. So have your
Not really. I've been testing up to 180k connections on a 2 core virtual machine and when doing PubSub over those (dispatching 1 msg to all 180k) .. CPU load isn't maxed out .. but I'm hitting a limit of around 20k/s dispatched msgs. My suspicion is it's probably more interrupt/syscall overhead bound ..
5k is much too low .. would require too many servers. My target is 200k per node and have as high as possible msgs/s dispatch rate. Probably a design with frontend nodes aggregating TCP connections is inevitable .. Thanks!

AFAIK that's not possible in TCP, only in multicast, and the kernel will make copies of that string no matter what, but I am highly unknowledgeable in this area. On 2/10/12, Tobias Oberstein <tobias.oberstein@tavendo.de> wrote:
-- When the facts change, I change my mind. What do you do, sir? ~ Keynes Corbin Simpson <MostAwesomeDude@gmail.com>

On 10/02/12 16:56, Tobias Oberstein wrote:
Not really, no. The problem is that TCP requires the sender of data to buffer so that it can re-send. The only way to store one copy of the data whilst doing this would be to store the socket buffer as a (fairly complex) linked list of reference-counted blocks, and use scatter-gather IO to the network card. So the kernel would have to copy the data 100k times anyway, to store it in the per-socket buffer until it was ACKed, or maintain a large and complex data structure so that it could use one copy. Therefore, by moving the work to the kernel, all you've done is consume valuable kernel memory, in return for saving the syscall overhead. Classic space/time tradeoff. If you were using UDP, then in theory this might be possible, but there are no APIs that I know of, except for multicast (where you only send one copy of the data, and the network duplicates it). In short; this kind of thing seems easy and desirable but actually it's really hard and not useful.

Resends -- oh, yes, right. Forgot about that.
store the socket buffer as a (fairly complex) linked list of reference-counted blocks, and use scatter-gather IO to the network card.
Doesn't a (modern) kernel do something like that for virtual memory pages ie.? For if processes mmap files or load shared objects, the kernel must keep track of pages which are needed by active processes. So in theory, if my message is 1 VM page, kernel could pin that page (make it non-swappable etc) until all TCP ACKs have arrived ..
In short; this kind of thing seems easy and desirable but actually it's really hard
Probably quite hard ..
and not useful.
When using VM pages (_if_ that would be possible) and thus no data duplication, then why not useful? However I now can see that it's likely outside of what can be done with a stock kernel .. if it can be done at all anyway .. Thanks!

On 02/10/2012 08:20 PM, Tobias Oberstein wrote:
Possibly. My knowledge of kernel memory management is a lot more patchy than network stacks. One option you could investigate, that I was going to suggest in my original reply but didn't have the time to write up, is the sendfile() API. If you write your message to a temporary file, you could call sendfile() on all 100k connections using the same file descriptor. So, something like: fd = os.open(PATH, os.O_RDWR) os.write(fd, message) os.unlink(PATH) for connection in biglist: connection.sendfile(fd, offset=0, len=100) os.close(fd) Now, as I understand it, sendfile() will perform zero-copy IO; since the contents of the file will undoubtedly be in the page cache, it should in theory DMA the data straight from the (single copy of the) data in RAM to the NIC buffers. It should also handle refcounting for you - you unlink the filename after obtaining a descriptor, and close() the FD once you've called sendfile, and the kernel *should* in theory free the inode and page containing file data once all TCP ACKs have been received. You'll still have to make 100k syscalls, and you may find the kernel chooses to copy the data anyway. However - AFAIK Twisted does not support sendfile(), and it can be tricky to make it work with non-blocking IO. :o( You may also want to look at the splice() vmsplice() and tee() syscalls added to recent Linux kernels. tee() in particular can copy data from pipe to pipe without consuming, so can be repeated multiple times. It may be possible to assemble something that will do this task efficiently from those building blocks, but the APIs aren't available in Twisted.
Sorry, I should have been more precise - it's probably not often useful. There are not very many applications where sending the same TCP stream to that many clients at the same time is helpful - realtime video/audio over TCP spring to mind, and typically those need to adapt to slow clients by dropping them to a lower rate i.e. not the same stream any more. As Glyph has mentioned, encryption is also a factor in todays internet. I'm kind of curious about what your application is!

I see. So using sendfile .. probably with message as file on RAMFS .. or using the Linux syscalls you mention below, it _might_ be possible to avoid copy overhead, but not context switching overhead .. ok.
However - AFAIK Twisted does not support sendfile(), and it can be tricky to make it work with non-blocking IO.
;( Apart from that, we're on FreeBSD .. guess there are similar syscalls (maybe with slightly different semantics) there also.
Thanks alot! This is all very interesting .. from the "tee" man page: """ Though we talk of copying, actual copies are generally avoided. The kernel does this by implementing a pipe buffer as a set of reference-counted pointers to pages of kernel memory. The kernel creates "copies" of pages in a buffer by creating new pointers (for the output buffer) referring to the pages, and increasing the reference counts for the pages: only pointers are copied, not the pages of the buffer. """ Which sounds alot like in your other reply talking about refcounting etc .. For ref., these guys are talking about PACKET_MMAP http://www.linuxquestions.org/questions/programming-9/vectored-write-to-many... http://dank.qemfd.net/dankwiki/index.php/Fast_UNIX_Servers The former (very end of page) claims that it achieves zero-copy (which I get), and also claims you could reduce context switch overheader for the 1 msg TX to many clients case .. which I can't see how it's done.
The application is PubSub over WebSockets with massive numbers of clients .. Application message payloads are short (<1k) and JSON/UTF-8. Those are then framed into WebSocket messages (which basically means prepending a WS frame header).

On Feb 10, 2012, at 12:49 PM, Phil Mayers wrote:
Not to mention the fact that inevitably, you probably are going to want some security on those connections, which means TLS, which means individual crypto connections. I believe the best model for this kind of high-volume reliable-multicast-over-unicast is a spanning tree, like what IRC uses for server-to-server communication. If you have 100,000 concurrent, active connections, you're already probably beyond the CPU constraints of a single machine. So have your clients connect (via round-robin DNS, or whatever other mechanism makes sense) to a variety of different servers, then have the servers connected to each other in a pattern such that one server tells ten friends, and they each tell ten friends, and so on, until each server is only responsible for spewing out data to a reasonable number of connections (5000 maybe?). -glyph

On 02/10/2012 09:54 PM, Glyph Lefkowitz wrote:
I believe the best model for this kind of high-volume reliable-multicast-over-unicast is a spanning tree, like what IRC
For what it's worth, real IP multicast is quite commonly used for distributing short messages to many clients in realtime in some closed networks, such as financial trading systems. With good network equipment that can handle low- or zero-loss timely delivery, it does work very well. Over the internet - less well, and even if your target networks are multicast-enabled, something like NORM or FEC is required to fill in the gaps, which inevitably adds latency. Add to which the fact that you want your protocol to be congestion-control compatible, which is seriously challenging with real multicast, and if you want crypto you've got to have some kind of group ciphering system..
I agree, this is probably the best model at the moment. It has the advantage that if you need to scale globally, you can push the "caching" nodes closer to the clients, and use geographical-aware routing to push clients onto closer nodes.

Emphasis on "good network equipment";) Yes, stuff that a) even supports IP Multicast and b) as very low loss rate .. If possible, I'd like to avoid IP Multicast .. just a gut feeling (have no experience with that in practice) .. and apart from that, we currently just don't have IP Multicast aware switches deployed ;(

If there is a need for encryption, then yes, .. but it's not always needed
I believe the best model for this kind of high-volume reliable-multicast-over- unicast is a spanning tree, like what IRC uses for server-to-server
Ok. Makes sense. I see .. could have frontend nodes, which act as aggregators condensing the messages on many incoming connections into 1 uplink TCP connection to a backend node .. or a hierarchy. This is quite interesting .. for I am now beginning to see why Google is pushing the discussion on WebSocket multiplexing extension on Hybi;) Until now, I thought that would be primarily because Chrome has a process per tab design .. but that's probably only one reaon. The cloud side being the other one ..
communication. If you have 100,000 concurrent, active connections, you're already probably beyond the CPU constraints of a single machine. So have your
Not really. I've been testing up to 180k connections on a 2 core virtual machine and when doing PubSub over those (dispatching 1 msg to all 180k) .. CPU load isn't maxed out .. but I'm hitting a limit of around 20k/s dispatched msgs. My suspicion is it's probably more interrupt/syscall overhead bound ..
5k is much too low .. would require too many servers. My target is 200k per node and have as high as possible msgs/s dispatch rate. Probably a design with frontend nodes aggregating TCP connections is inevitable .. Thanks!
participants (4)
-
Corbin Simpson
-
Glyph Lefkowitz
-
Phil Mayers
-
Tobias Oberstein