Performance issues of socket.recv_into()

I have been testing network I/O performance of Python and PyPy. I have used the small script in the attachment to test TCP bandwidth between two powerful servers, each has two 10Gb nics with multi-queue hardware support to form a bond, so the physical bandwidth between the servers are 20Gbps. The script is a simple socket client/server with multiple connections on multiple threads. The server tries its best to receive all data, and the client tries its best to send data, after 30 sec, the client stops and calculates the data sent. It seems that there is a huge performance difference for the server to use pypy or CPython: (python = CPython, pypy = PyPy, 100.102.4.7 is my test server IP address, and 32 is the number of concurrent connections) With python testio.py -s on another server: pypy testio.py -c 100.102.4.7 32 total time = 30.05s, total send = 66601 MB, speed = 17.31 Gbps python testio.py -c 100.102.4.7 32 total time = 30.08s, total send = 67008 MB, speed = 17.40 Gbps With pypy testio.py -s on another server: pypy testio.py -c 100.102.4.7 32 total time = 30.36s, total send = 5838 MB, speed = 1.50 Gbps python testio.py -c 100.102.4.7 32 total time = 30.36s, total send = 5742 MB, speed = 1.48 Gbps But when I change while s.recv_into(recv_buf, 0x200000): to while s.recv(0x200000): The performance difference disappears, and both PyPy server and CPython server have good performance (17+Gbps) It is worth to point out that when using socket.recv_into(), PyPy server only uses 100% CPU (one core), while CPython uses much more. Maybe there are some unexpected issues about socket.recv_into, e.g. GIL not released? P.S. it is interesting that though I thought recv_into() should be more efficient thant recv() since it reduces extra object creation / destruction, the test result shows that recv() outperforms recv_into(), even with CPython. With CPython, it seems server with recv() costs less CPU time than recv_into(), but having the same I/O performance. 2016-08-08 hubo

By the way, the result also reproduces on localhost(127.0.0.1), and change the ctypes.create_string_buffer to bytearray does not change the result. 2016-08-08 hubo 发件人:"hubo" <hubo@jiedaibao.com> 发送时间:2016-08-08 22:00 主题:[pypy-dev] Performance issues of socket.recv_into() 收件人:"PyPy Developer Mailing List"<pypy-dev@python.org> 抄送: I have been testing network I/O performance of Python and PyPy. I have used the small script in the attachment to test TCP bandwidth between two powerful servers, each has two 10Gb nics with multi-queue hardware support to form a bond, so the physical bandwidth between the servers are 20Gbps. The script is a simple socket client/server with multiple connections on multiple threads. The server tries its best to receive all data, and the client tries its best to send data, after 30 sec, the client stops and calculates the data sent. It seems that there is a huge performance difference for the server to use pypy or CPython: (python = CPython, pypy = PyPy, 100.102.4.7 is my test server IP address, and 32 is the number of concurrent connections) With python testio.py -s on another server: pypy testio.py -c 100.102.4.7 32 total time = 30.05s, total send = 66601 MB, speed = 17.31 Gbps python testio.py -c 100.102.4.7 32 total time = 30.08s, total send = 67008 MB, speed = 17.40 Gbps With pypy testio.py -s on another server: pypy testio.py -c 100.102.4.7 32 total time = 30.36s, total send = 5838 MB, speed = 1.50 Gbps python testio.py -c 100.102.4.7 32 total time = 30.36s, total send = 5742 MB, speed = 1.48 Gbps But when I change while s.recv_into(recv_buf, 0x200000): to while s.recv(0x200000): The performance difference disappears, and both PyPy server and CPython server have good performance (17+Gbps) It is worth to point out that when using socket.recv_into(), PyPy server only uses 100% CPU (one core), while CPython uses much more. Maybe there are some unexpected issues about socket.recv_into, e.g. GIL not released? P.S. it is interesting that though I thought recv_into() should be more efficient thant recv() since it reduces extra object creation / destruction, the test result shows that recv() outperforms recv_into(), even with CPython. With CPython, it seems server with recv() costs less CPU time than recv_into(), but having the same I/O performance. 2016-08-08 hubo

Hi, On 8 August 2016 at 16:09, hubo <hubo@jiedaibao.com> wrote:
As you found out, recv_into() is not better than recv() in your use case. (There are other use cases where it can be useful, e.g. to receive in the middle of a buffer where there is already some data in the beginning of the buffer and you want all data concatenated.) Indeed, the current PyPy implementation of recv_into() is definitely bad. It just does a regular recv(), and then it manually copies the data into the buffer! So it always done one more copy of the data than recv(). I just fixed it in e53ea5c9c384. Now in PyPy (like in CPython), both recv() and recv_into() don't copy the received data at all. The difference between the two in this case is completely lost in the noise---it's the difference between doing a malloc or not, which costs nothing compared to transferring 2MB of data from the kernel to userspace. A bientôt, Armin.

Unfortunately, "to receive in the middle of a buffer where there is already some data in the beginning of the buffer and you want all data concatenated", that is exactly what I really want... This test script is just a simplified version. So I am really sad to find out that recv_into() is not very efficient. What I am really insterested in is that it seems it works exactly as two lines of Python code which first do a recv() and then copy the data to the buffer, which means maybe the copy is done with GIL locked. I think that is the source of the huge performance difference. In CPython, recv_into() executes without GIL locked, so multiple threads can use multiple CPUs which improves the performance a lot. 2016-08-09 hubo 发件人:Armin Rigo <arigo@tunes.org> 发送时间:2016-08-09 17:30 主题:Re: [pypy-dev] Performance issues of socket.recv_into() 收件人:"hubo"<hubo@jiedaibao.com> 抄送:"PyPy Developer Mailing List"<pypy-dev@python.org> Hi, On 8 August 2016 at 16:09, hubo <hubo@jiedaibao.com> wrote:
As you found out, recv_into() is not better than recv() in your use case. (There are other use cases where it can be useful, e.g. to receive in the middle of a buffer where there is already some data in the beginning of the buffer and you want all data concatenated.) Indeed, the current PyPy implementation of recv_into() is definitely bad. It just does a regular recv(), and then it manually copies the data into the buffer! So it always done one more copy of the data than recv(). I just fixed it in e53ea5c9c384. Now in PyPy (like in CPython), both recv() and recv_into() don't copy the received data at all. The difference between the two in this case is completely lost in the noise---it's the difference between doing a malloc or not, which costs nothing compared to transferring 2MB of data from the kernel to userspace. A bientôt, Armin.

Thanks a lot, I'm really looking forward to it! 2016-08-09 hubo 发件人:Armin Rigo <arigo@tunes.org> 发送时间:2016-08-09 21:10 主题:Re: [pypy-dev] Performance issues of socket.recv_into() 收件人:"hubo"<hubo@jiedaibao.com> 抄送:"PyPy Developer Mailing List"<pypy-dev@python.org> Hi, On 9 August 2016 at 15:08, hubo <hubo@jiedaibao.com> wrote:
As I said: I have fixed it now. The next release of PyPy will contain an efficient recv_into(). A bientôt, Armin.

By the way, the result also reproduces on localhost(127.0.0.1), and change the ctypes.create_string_buffer to bytearray does not change the result. 2016-08-08 hubo 发件人:"hubo" <hubo@jiedaibao.com> 发送时间:2016-08-08 22:00 主题:[pypy-dev] Performance issues of socket.recv_into() 收件人:"PyPy Developer Mailing List"<pypy-dev@python.org> 抄送: I have been testing network I/O performance of Python and PyPy. I have used the small script in the attachment to test TCP bandwidth between two powerful servers, each has two 10Gb nics with multi-queue hardware support to form a bond, so the physical bandwidth between the servers are 20Gbps. The script is a simple socket client/server with multiple connections on multiple threads. The server tries its best to receive all data, and the client tries its best to send data, after 30 sec, the client stops and calculates the data sent. It seems that there is a huge performance difference for the server to use pypy or CPython: (python = CPython, pypy = PyPy, 100.102.4.7 is my test server IP address, and 32 is the number of concurrent connections) With python testio.py -s on another server: pypy testio.py -c 100.102.4.7 32 total time = 30.05s, total send = 66601 MB, speed = 17.31 Gbps python testio.py -c 100.102.4.7 32 total time = 30.08s, total send = 67008 MB, speed = 17.40 Gbps With pypy testio.py -s on another server: pypy testio.py -c 100.102.4.7 32 total time = 30.36s, total send = 5838 MB, speed = 1.50 Gbps python testio.py -c 100.102.4.7 32 total time = 30.36s, total send = 5742 MB, speed = 1.48 Gbps But when I change while s.recv_into(recv_buf, 0x200000): to while s.recv(0x200000): The performance difference disappears, and both PyPy server and CPython server have good performance (17+Gbps) It is worth to point out that when using socket.recv_into(), PyPy server only uses 100% CPU (one core), while CPython uses much more. Maybe there are some unexpected issues about socket.recv_into, e.g. GIL not released? P.S. it is interesting that though I thought recv_into() should be more efficient thant recv() since it reduces extra object creation / destruction, the test result shows that recv() outperforms recv_into(), even with CPython. With CPython, it seems server with recv() costs less CPU time than recv_into(), but having the same I/O performance. 2016-08-08 hubo

Hi, On 8 August 2016 at 16:09, hubo <hubo@jiedaibao.com> wrote:
As you found out, recv_into() is not better than recv() in your use case. (There are other use cases where it can be useful, e.g. to receive in the middle of a buffer where there is already some data in the beginning of the buffer and you want all data concatenated.) Indeed, the current PyPy implementation of recv_into() is definitely bad. It just does a regular recv(), and then it manually copies the data into the buffer! So it always done one more copy of the data than recv(). I just fixed it in e53ea5c9c384. Now in PyPy (like in CPython), both recv() and recv_into() don't copy the received data at all. The difference between the two in this case is completely lost in the noise---it's the difference between doing a malloc or not, which costs nothing compared to transferring 2MB of data from the kernel to userspace. A bientôt, Armin.

Unfortunately, "to receive in the middle of a buffer where there is already some data in the beginning of the buffer and you want all data concatenated", that is exactly what I really want... This test script is just a simplified version. So I am really sad to find out that recv_into() is not very efficient. What I am really insterested in is that it seems it works exactly as two lines of Python code which first do a recv() and then copy the data to the buffer, which means maybe the copy is done with GIL locked. I think that is the source of the huge performance difference. In CPython, recv_into() executes without GIL locked, so multiple threads can use multiple CPUs which improves the performance a lot. 2016-08-09 hubo 发件人:Armin Rigo <arigo@tunes.org> 发送时间:2016-08-09 17:30 主题:Re: [pypy-dev] Performance issues of socket.recv_into() 收件人:"hubo"<hubo@jiedaibao.com> 抄送:"PyPy Developer Mailing List"<pypy-dev@python.org> Hi, On 8 August 2016 at 16:09, hubo <hubo@jiedaibao.com> wrote:
As you found out, recv_into() is not better than recv() in your use case. (There are other use cases where it can be useful, e.g. to receive in the middle of a buffer where there is already some data in the beginning of the buffer and you want all data concatenated.) Indeed, the current PyPy implementation of recv_into() is definitely bad. It just does a regular recv(), and then it manually copies the data into the buffer! So it always done one more copy of the data than recv(). I just fixed it in e53ea5c9c384. Now in PyPy (like in CPython), both recv() and recv_into() don't copy the received data at all. The difference between the two in this case is completely lost in the noise---it's the difference between doing a malloc or not, which costs nothing compared to transferring 2MB of data from the kernel to userspace. A bientôt, Armin.

Thanks a lot, I'm really looking forward to it! 2016-08-09 hubo 发件人:Armin Rigo <arigo@tunes.org> 发送时间:2016-08-09 21:10 主题:Re: [pypy-dev] Performance issues of socket.recv_into() 收件人:"hubo"<hubo@jiedaibao.com> 抄送:"PyPy Developer Mailing List"<pypy-dev@python.org> Hi, On 9 August 2016 at 15:08, hubo <hubo@jiedaibao.com> wrote:
As I said: I have fixed it now. The next release of PyPy will contain an efficient recv_into(). A bientôt, Armin.
participants (3)
-
Armin Rigo
-
hubo
-
Vincent Legoll