[Twisted-Python] Unable to write to "stuck" TCP client connections

Hi, I am encountering a weird bug, where some TCP client connections get into a state where the server is able to read data sent from the client, but not able to send any data with transport.write(). With some help from the #twitsed IRC channel, I was able to gather the following information regarding the bug. While I'm still unable to provide steps to reporduce this bug, I am able to reliably find clients who are in this state. I am running 24 instances of the twisted server (Epoll reactor) running on Ubuntu, with a peak traffic of >130k users. At any instance, there are < 20 TCP connections stuck in this state. Here is some information about the bug: 1. transport.write() does not send anything down the socket 2. transport.doWrite() will send all the data that has been buffered up, and then stop sending any new data. 3. transport.writeSomeData() will send data 3. reactor.getWriters() will return a list of transports that are all stuck in this state, and the writers will remain in this list. 4. Calling reactor.removeWriter(transport) will "unstuck" the transport and data gets streamed once again. 5. A small number of clients will receive data for a while, and return to this stuck state. Most return to normal once reactor.removeWriter() is called. 6. Based on the suggestion from IRC user _habnabit, I used strace after removing the writer, here is the output: (4:52:09 PM) thewrongboy: epoll_ctl(3, EPOLL_CTL_MOD, 6504, {EPOLLIN, {u32=6504, u64=22205092589476200}}) = 0 (4:52:09 PM) thewrongboy: epoll_ctl(3, EPOLL_CTL_MOD, 6504, {EPOLLIN|EPOLLOUT, {u32=6504, u64=22205092589476200}}) = 0 (4:52:09 PM) thewrongboy: epoll_ctl(3, EPOLL_CTL_MOD, 6504, {EPOLLIN, {u32=6504, u64=22205092589476200}}) = 0 (4:52:09 PM) thewrongboy: epoll_ctl(3, EPOLL_CTL_MOD, 6504, {EPOLLIN|EPOLLOUT, {u32=6504, u64=22205092589476200}}) = 0 (4:52:09 PM) thewrongboy: epoll_ctl(3, EPOLL_CTL_MOD, 6504, {EPOLLIN, {u32=6504, u64=22205092589476200}}) = 0 For now, I am using a LoopingCall to check and remove transports that are stuck in getWriters(). I am using Twisted 12.3.0 on Ubuntu 12.04 - 3.2.0-35-generic #55-Ubuntu SMP Wed Dec 5 17:42:16 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux. Has anyone else experienced this weird problem? I'd love to provide more information regarding this bug.

On Feb 24, 2013, at 4:22 PM, Wenxiang Wu <wenxiang@zopim.com> wrote:
Has anyone else experienced this weird problem? I'd love to provide more information regarding this bug.
I have seen bugs which _might_ be this problem, but I'm not sure. I don't think I've ever been able to reproduce it. Can you test with any reactors other than epoll? Do you have a test case which will reproduce it deterministically? -g

Unfortunately for performance reasons, I cannot test this with any other reactor. And the only time I have ever been able to reproduce this was when I left my client running overnight. However due to the amount of traffic we are seeing, I am able to reliably find connections stuck in this state. I get 2-3 connections stuck every hour. On Mon, Feb 25, 2013 at 3:25 PM, Glyph <glyph@twistedmatrix.com> wrote:
-- Wenxiang Wu VP Partying / Engineering www.zopim.com [US] +1 (408) 680-9345 [SG] +65 9457-5822

On Feb 25, 2013, at 4:17 PM, Wenxiang Wu <wenxiang@zopim.com> wrote:
Unfortunately for performance reasons, I cannot test this with any other reactor. And the only time I have ever been able to reproduce this was when I left my client running overnight.
Understandable; I've never managed to reliably reproduce it in a test environment either.
However due to the amount of traffic we are seeing, I am able to reliably find connections stuck in this state. I get 2-3 connections stuck every hour.
Do these connections share any obvious attributes? For example, are they to clients on some particular network? -g

Also, this *seems* to affect users in the same LAN more. The 2 groups of affected users I worked closely with regarding this issue were both connected to the same WiFi router. On Mon, Feb 25, 2013 at 4:36 PM, Glyph <glyph@twistedmatrix.com> wrote:
-- Wenxiang Wu VP Partying / Engineering www.zopim.com [US] +1 (408) 680-9345 [SG] +65 9457-5822

On Feb 25, 2013, at 10:24 PM, Wenxiang Wu <wenxiang@zopim.com> wrote:
From what I can tell, there isn't a specific pattern that. But from this small sample size, users that run into this problem are those with higher levels of usage, i.e. connections with more traffic.
Also, this *seems* to affect users in the same LAN more. The 2 groups of affected users I worked closely with regarding this issue were both connected to the same WiFi router.
So, it sounds like it's happening with connections that are (A) faster and (B) shipping around more traffic. Smells like a possible race condition - and maybe not in Twisted. Is it possible for you to test with any other OS / kernel variations, to see if it behaves differently on other Linux versions, ideally some with changes to epoll? I don't know epoll quite well enough to read the strace and tell what's going on. If those are the only calls being traced... there's no gettimeofday or send or recv calls? That suggests it's just thrashing, but if your workaround works, other stuff must be going on. Hmm. I guess I don't have a clear enough picture for any useful conjecture yet :). Can you open a bug, and attach a more complete strace, assuming that the one with just the epoll calls was filtered? -glyph

On Feb 24, 2013, at 4:22 PM, Wenxiang Wu <wenxiang@zopim.com> wrote:
Has anyone else experienced this weird problem? I'd love to provide more information regarding this bug.
I have seen bugs which _might_ be this problem, but I'm not sure. I don't think I've ever been able to reproduce it. Can you test with any reactors other than epoll? Do you have a test case which will reproduce it deterministically? -g

Unfortunately for performance reasons, I cannot test this with any other reactor. And the only time I have ever been able to reproduce this was when I left my client running overnight. However due to the amount of traffic we are seeing, I am able to reliably find connections stuck in this state. I get 2-3 connections stuck every hour. On Mon, Feb 25, 2013 at 3:25 PM, Glyph <glyph@twistedmatrix.com> wrote:
-- Wenxiang Wu VP Partying / Engineering www.zopim.com [US] +1 (408) 680-9345 [SG] +65 9457-5822

On Feb 25, 2013, at 4:17 PM, Wenxiang Wu <wenxiang@zopim.com> wrote:
Unfortunately for performance reasons, I cannot test this with any other reactor. And the only time I have ever been able to reproduce this was when I left my client running overnight.
Understandable; I've never managed to reliably reproduce it in a test environment either.
However due to the amount of traffic we are seeing, I am able to reliably find connections stuck in this state. I get 2-3 connections stuck every hour.
Do these connections share any obvious attributes? For example, are they to clients on some particular network? -g

Also, this *seems* to affect users in the same LAN more. The 2 groups of affected users I worked closely with regarding this issue were both connected to the same WiFi router. On Mon, Feb 25, 2013 at 4:36 PM, Glyph <glyph@twistedmatrix.com> wrote:
-- Wenxiang Wu VP Partying / Engineering www.zopim.com [US] +1 (408) 680-9345 [SG] +65 9457-5822

On Feb 25, 2013, at 10:24 PM, Wenxiang Wu <wenxiang@zopim.com> wrote:
From what I can tell, there isn't a specific pattern that. But from this small sample size, users that run into this problem are those with higher levels of usage, i.e. connections with more traffic.
Also, this *seems* to affect users in the same LAN more. The 2 groups of affected users I worked closely with regarding this issue were both connected to the same WiFi router.
So, it sounds like it's happening with connections that are (A) faster and (B) shipping around more traffic. Smells like a possible race condition - and maybe not in Twisted. Is it possible for you to test with any other OS / kernel variations, to see if it behaves differently on other Linux versions, ideally some with changes to epoll? I don't know epoll quite well enough to read the strace and tell what's going on. If those are the only calls being traced... there's no gettimeofday or send or recv calls? That suggests it's just thrashing, but if your workaround works, other stuff must be going on. Hmm. I guess I don't have a clear enough picture for any useful conjecture yet :). Can you open a bug, and attach a more complete strace, assuming that the one with just the epoll calls was filtered? -glyph
participants (5)
-
Gelin Yan
-
Glyph
-
Itamar Turner-Trauring
-
Itamar Turner-Trauring
-
Wenxiang Wu