Buildbot timing out - test suite failure - test_socket issue with UDP6?
I just had a major crash on the system that hosts the angelico-debian-amd64 buildbot, and as usual, checked it carefully after bringing everything up. It seems now to be timing out after an hour of operation: http://buildbot.python.org/all/builders/AMD64%20Debian%20root%203.x/builds/3... This is happening across all the 3.* branches on my buildbot, but NOT on 2.7, and not on any other buildbots. That makes it look like some kind of config problem at my end. In seeking to diagnose it, I duplicated the 3.x build directory and manually ran the commands to run a test, and it stalled out (I didn't let it go to the whole hour but it sat there for some minutes) at the same place: in test_socket. Running just that test file: $ ./python Lib/test/test_socket.py ... chomp lots of lines ... testRecvmsgPeek (__main__.RecvmsgUDP6Test) ... seems to indicate that the stall is due to IPv6 and UDP. The VM should have full IPv6 support, although my ISPs don't carry IPv6 traffic, so it won't be able to reach the internet proper; but it should be able to do all manner of local tests. The test runs just fine on my main system, so it's only failing in the buildbot VM. Any ideas as to what's going on or how to diagnose? By the way, this looks odd: make buildbottest TESTOPTS= TESTPYTHONOPTS= TESTTIMEOUT=3600 in dir /root/buildarea/3.x.angelico-debian-amd64/build (timeout 3900 secs) The parameter says 3600 (which corresponds to the error message at the end), but it echoes back that the timeout is 3900 seconds. ChrisA
On Sat, Jan 23, 2016 at 12:03 AM, Chris Angelico
By the way, this looks odd:
make buildbottest TESTOPTS= TESTPYTHONOPTS= TESTTIMEOUT=3600 in dir /root/buildarea/3.x.angelico-debian-amd64/build (timeout 3900 secs)
The parameter says 3600 (which corresponds to the error message at the end), but it echoes back that the timeout is 3900 seconds.
I'm no help on the main issue, but to explain the timeout difference: TESTTIMEOUT is a makefile variable that sets the faulthandler timeout that tries to make Python bail out with a stack trace instead of letting buildbot kill Python silently. The 3900 second timeout is buildbot's "there's been no output in this long, assume it's hung and kill it" timeout. The difference between the two is to give faulthandler a chance to do its thing. -- Zach
On Sat, Jan 23, 2016 at 5:39 PM, Zachary Ware
On Sat, Jan 23, 2016 at 12:03 AM, Chris Angelico
wrote: By the way, this looks odd:
make buildbottest TESTOPTS= TESTPYTHONOPTS= TESTTIMEOUT=3600 in dir /root/buildarea/3.x.angelico-debian-amd64/build (timeout 3900 secs)
The parameter says 3600 (which corresponds to the error message at the end), but it echoes back that the timeout is 3900 seconds.
I'm no help on the main issue, but to explain the timeout difference: TESTTIMEOUT is a makefile variable that sets the faulthandler timeout that tries to make Python bail out with a stack trace instead of letting buildbot kill Python silently. The 3900 second timeout is buildbot's "there's been no output in this long, assume it's hung and kill it" timeout. The difference between the two is to give faulthandler a chance to do its thing.
Ah, cool. That's one mystery cleared up, at least. ChrisA
3600 seconds is the maximum duration of a single test file. We may reduce it since a single test file should not take longer than 30 min. Maybe we can do better and put the timeout on a single test function. Victor
On Sun, Jan 24, 2016 at 1:37 AM, Victor Stinner
3600 seconds is the maximum duration of a single test file. We may reduce it since a single test file should not take longer than 30 min. Maybe we can do better and put the timeout on a single test function.
I'd be inclined to put some strong timeouts on test_socket.py (at some level or other of granularity). Most of those tests should be finished in a fraction of a second; a few might take a few seconds, maybe. None should take a whole minute. But they might easily sit around that long. But I'd rather know what I messed up in my recreation of the VM's configs. ChrisA
2016-01-23 7:03 GMT+01:00 Chris Angelico
I just had a major crash on the system that hosts the angelico-debian-amd64 buildbot, and as usual, checked it carefully after bringing everything up. It seems now to be timing out after an hour of operation:
http://buildbot.python.org/all/builders/AMD64%20Debian%20root%203.x/builds/3...
I opened http://bugs.python.org/issue26206 to track this issue.
Running just that test file:
$ ./python Lib/test/test_socket.py ... chomp lots of lines ... testRecvmsgPeek (__main__.RecvmsgUDP6Test) ...
seems to indicate that the stall is due to IPv6 and UDP. The VM should have full IPv6 support, although my ISPs don't carry IPv6 traffic, so it won't be able to reach the internet proper; but it should be able to do all manner of local tests.
Try to apply attached patch and run:
$ ./python -m test -v -m testRecvmsgPeek test_socket
(...)
testRecvmsgPeek (test.test_socket.RecvmsgUDP6Test) ... CLI SOCK
On Wed, Jan 27, 2016 at 8:39 PM, Victor Stinner
2016-01-23 7:03 GMT+01:00 Chris Angelico
: Running just that test file:
$ ./python Lib/test/test_socket.py ... chomp lots of lines ... testRecvmsgPeek (__main__.RecvmsgUDP6Test) ...
seems to indicate that the stall is due to IPv6 and UDP. The VM should have full IPv6 support, although my ISPs don't carry IPv6 traffic, so it won't be able to reach the internet proper; but it should be able to do all manner of local tests.
Try to apply attached patch and run:
$ ./python -m test -v -m testRecvmsgPeek test_socket (...) testRecvmsgPeek (test.test_socket.RecvmsgUDP6Test) ... CLI SOCK
SERV SOCK CLI SOCK ('::1', 44347, 0, 0) SERV SOCK ('::1', 40488, 0, 0) ok testRecvmsgPeek (test.test_socket.RecvmsgIntoUDP6Test) ... CLI SOCK SERV SOCK CLI SOCK ('::1', 52721, 0, 0) SERV SOCK ('::1', 43967, 0, 0) ok (...) As you can see: the test uses the local loopback interface. Inet6TestBase.host is "::1".
Confirmed. It does two tests of IPv4 which run just fine, and then:
testRecvmsgPeek (test.test_socket.RecvmsgUDP6Test) ... CLI SOCK
You can try to run a UDP server using netcat: "nc -l -u ::1 12345". Keep the command running in a terminal, and then run in a different terminal: "echo abc | nc -u ::1 12345". You should receive abc in the server.
(Oddly, the default netcat-traditional doesn't seem to support this, but installing netcat-openbsd adds IPv6 support.) Yep, and that works flawlessly. It's nothing weird about that particular port, either - nc can use 42421 without a problem. After digging through test_socket.py for over an hour (the MRO for RecvmsgUDP6Test is enormous!!), I've boiled the issue down to this: import socket MSG = b'asdf qwer zxcv' serv = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM) serv.bind(("::1", 0)) cli = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM) cli.bind(("::1", 0)) cli.sendto(MSG, serv.getsockname()) print(serv.recvmsg(len(MSG) - 3, 0, socket.MSG_PEEK)) print(serv.recvmsg(len(MSG), 0, socket.MSG_PEEK)) print(serv.recvmsg(len(MSG))) On my main system, this produces three lines of output: the first has truncated text, the second has full text, and the third also has full text. This proves that MSG_PEEK is working correctly. On the buildbot, though, the first one stalls out. Commenting that line out produces correct results - peek the full data, then read it, and all is well. Any idea why partial read on a datagram socket would sometimes stall? ChrisA
After digging through test_socket.py for over an hour (the MRO for RecvmsgUDP6Test is enormous!!), I've boiled the issue down to this:
import socket MSG = b'asdf qwer zxcv' serv = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM) serv.bind(("::1", 0)) cli = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM) cli.bind(("::1", 0)) cli.sendto(MSG, serv.getsockname()) print(serv.recvmsg(len(MSG) - 3, 0, socket.MSG_PEEK)) print(serv.recvmsg(len(MSG), 0, socket.MSG_PEEK)) print(serv.recvmsg(len(MSG)))
On my main system, this produces three lines of output: the first has truncated text, the second has full text, and the third also has full text. This proves that MSG_PEEK is working correctly. On the buildbot, though, the first one stalls out. Commenting that line out produces correct results - peek the full data, then read it, and all is well.
Any idea why partial read on a datagram socket would sometimes stall?
I think it would stall if there is no data to receive. Maybe check the return value of sendto(), to ensure it is sending the whole message. Attached is a C program which should do the equivalent of your boiled-down Python script, in case that helps: $ gcc -Wall peek-udp6.c -o peek-udp6 $ ./peek-udp6 Bytes sent: 14 Received [asdf qwer z] Received [asdf qwer zxcv] Received [asdf qwer zxcv] Other things that come to mind are to see if there is anything odd about the buildbot’s Linux kernel and glibc versions. Maybe run the Python script under “strace” to see if anything strange is going on.
On Thu, Jan 28, 2016 at 7:35 PM, Martin Panter
Other things that come to mind are to see if there is anything odd about the buildbot’s Linux kernel and glibc versions. Maybe run the Python script under “strace” to see if anything strange is going on.
I did that, and a few other things. Most notably, commenting out the partial-read resulted in a flawless run, and strace showed it stalling in the partial read too. However, as I was doing so (and I just discarded a draft message where I'd been typing up notes), my entire system went kerblooie, and I've spent the last day rebuilding stuff from scratch. When I get around to it, I'll rebuild the buildbot VM - and it'll be Debian Jessie (current stable), because there's no particular reason to use Wheezy (oldstable). So the problem will almost certainly disappear. ChrisA
On Thu, Jan 28, 2016 at 9:41 PM, Chris Angelico
However, as I was doing so (and I just discarded a draft message where I'd been typing up notes), my entire system went kerblooie, and I've spent the last day rebuilding stuff from scratch. When I get around to it, I'll rebuild the buildbot VM - and it'll be Debian Jessie (current stable), because there's no particular reason to use Wheezy (oldstable). So the problem will almost certainly disappear.
Sure enough, the problem is no longer reproducible. Sorry folks. Hopefully it wasn't actually a bug anywhere, but was some bizarre piece of VM setup weirdness. The "AMD64 Debian root" buildbot (angelico-debian-amd64) is now running Debian Jessie, rather than the Wheezy it was before. ChrisA
participants (4)
-
Chris Angelico
-
Martin Panter
-
Victor Stinner
-
Zachary Ware