socket.SOL_REUSEADDR: different semantics between Windows vs Unix (or why test_asynchat is sometimes dying on Windows)
I started looking into this: http://www.python.org/dev/buildbot/all/x86%20W2k8%20trunk/builds/289/step-te... Pertinent part: test_asyncore <snip> test_asynchat command timed out: 1200 seconds without output SIGKILL failed to kill process using fake rc=-1 program finished with exit code -1 remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last): Failure: buildbot.slave.commands.TimeoutError: SIGKILL failed to kill process ] I tried to replicate it on the buildbot in order to debug, which, surprisingly, I could do consistently by just running rt.bat -q -d -uall test_asynchat. As the log above indicates, the python process becomes completely and utterly wedged, to the point that I can't even attach a remote debugger and step into it. Digging further, I noticed that if I ran the following code in two different python consoles, EADDRINUSE was *NOT* being raised by socket.bind(): import socket s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) s.bind(('127.0.0.1', 54322)) However, take out the setsockopt line, and wallah, the second s.bind() will raise EADDRINUSE, as expected. This manifests into a really bizarre issue with test_asynchat in particualr, as subsequent sock.accept() calls on the socket put python into the uber wedged state (can't even ctrl-c out at the console, need to kill the process directly). Have to leave the office and head home so I don't have any more time to look at it tonight -- just wanted to post here for others to mull over. Trent.
I've raised issue 2550 to track this problem. I've also provided a patch on the tracker to test_socket.py that reproduces the issue. Anyone mind if I commit this to trunk? I'd like to observe if any other platforms exhibit different behaviour via buildbots. It'll cause all the Windows slaves to fail on test_socket though. (I can revert it once I've seen how the buildbots behave until I can come up with an actual patch for Windows that fixes the issue.) http://bugs.python.org/issue2550 http://bugs.python.org/file9939/test_socket.py.patch Trent. ________________________________________ From: python-dev-bounces+tnelson=onresolve.com@python.org [python-dev-bounces+tnelson=onresolve.com@python.org] On Behalf Of Trent Nelson [tnelson@onresolve.com] Sent: 03 April 2008 22:40 To: python-dev@python.org Subject: [Python-Dev] socket.SOL_REUSEADDR: different semantics between Windows vs Unix (or why test_asynchat is sometimes dying on Windows) I started looking into this: http://www.python.org/dev/buildbot/all/x86%20W2k8%20trunk/builds/289/step-te... Pertinent part: test_asyncore <snip> test_asynchat command timed out: 1200 seconds without output SIGKILL failed to kill process using fake rc=-1 program finished with exit code -1 remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last): Failure: buildbot.slave.commands.TimeoutError: SIGKILL failed to kill process ] I tried to replicate it on the buildbot in order to debug, which, surprisingly, I could do consistently by just running rt.bat -q -d -uall test_asynchat. As the log above indicates, the python process becomes completely and utterly wedged, to the point that I can't even attach a remote debugger and step into it. Digging further, I noticed that if I ran the following code in two different python consoles, EADDRINUSE was *NOT* being raised by socket.bind(): import socket s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) s.bind(('127.0.0.1', 54322)) However, take out the setsockopt line, and wallah, the second s.bind() will raise EADDRINUSE, as expected. This manifests into a really bizarre issue with test_asynchat in particualr, as subsequent sock.accept() calls on the socket put python into the uber wedged state (can't even ctrl-c out at the console, need to kill the process directly). Have to leave the office and head home so I don't have any more time to look at it tonight -- just wanted to post here for others to mull over. Trent. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/tnelson%40onresolve.com
Interesting results! I committed the patch to test_socket.py in r62152. I was expecting all other platforms except for Windows to behave consistently (i.e. pass). That is, given the following: import socket host = '127.0.0.1' sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.bind((host, 0)) port = sock.getsockname()[1] sock.close() del sock sock1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock1.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) sock1.bind((host, port)) sock2 = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock2.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) sock2.bind((host, port)) ^^^^ ....the second bind should fail with EADDRINUSE, at least according to the 'SO_REUSEADDR and SO_REUSEPORT Socket Options' section in chapter 7.5 of Stevens' UNIX Network Programming Volume 1 (2nd Ed): "With TCP, we are never able to start multiple servers that bind the same IP address and same port: a completely duplicate binding. That is, we cannot start one server that binds 198.69.10.2 port 80 and start another that also binds 198.69.10.2 port 80, even if we set the SO_REUSEADDR socket option for the second server." The results: both Windows *and* Linux fail the patched test; none of the buildbots for either platform encountered an EADDRINUSE socket.error after the second bind(). FreeBSD, OS X, Solaris and Tru64 pass the test -- EADDRINUSE is raised on the second bind. (Interesting that all the ones that passed have a BSD lineage.) I've just reverted the test in r62156 as planned. The real issue now is that there are tests that are calling test_support.bind_socket() with the assumption that the port returned by this method is 'unbound', when in fact, the current implementation can't guarantee this: def bind_port(sock, host='', preferred_port=54321): for port in [preferred_port, 9907, 10243, 32999, 0]: try: sock.bind((host, port)) if port == 0: port = sock.getsockname()[1] return port except socket.error, (err, msg): if err != errno.EADDRINUSE: raise print >>sys.__stderr__, \ ' WARNING: failed to listen on port %d, trying another' % port This logic is only correct for platforms other than Windows and Linux. I haven't looked into all the networking test cases that rely on bind_port(), but I would think an implementation such as this would be much more reliable than what we've got for returning an unused port: def bind_port(sock, host='127.0.0.1', *args): s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.bind((host, 0)) port = s.getsockname()[1] s.close() del s sock.bind((host, port)) return port Actually, FWIW, I just ran a full regrtest.py against trunk on Win32 with this change in place and all the tests still pass. Thoughts? Trent. ________________________________________ From: python-dev-bounces+tnelson=onresolve.com@python.org [python-dev-bounces+tnelson=onresolve.com@python.org] On Behalf Of Trent Nelson [tnelson@onresolve.com] Sent: 04 April 2008 17:07 To: python-dev@python.org Subject: Re: [Python-Dev] socket.SOL_REUSEADDR: different semantics between Windows vs Unix (or why test_asynchat is sometimes dying on Windows) I've raised issue 2550 to track this problem. I've also provided a patch on the tracker to test_socket.py that reproduces the issue. Anyone mind if I commit this to trunk? I'd like to observe if any other platforms exhibit different behaviour via buildbots. It'll cause all the Windows slaves to fail on test_socket though. (I can revert it once I've seen how the buildbots behave until I can come up with an actual patch for Windows that fixes the issue.) http://bugs.python.org/issue2550 http://bugs.python.org/file9939/test_socket.py.patch Trent. ________________________________________ From: python-dev-bounces+tnelson=onresolve.com@python.org [python-dev-bounces+tnelson=onresolve.com@python.org] On Behalf Of Trent Nelson [tnelson@onresolve.com] Sent: 03 April 2008 22:40 To: python-dev@python.org Subject: [Python-Dev] socket.SOL_REUSEADDR: different semantics between Windows vs Unix (or why test_asynchat is sometimes dying on Windows) I started looking into this: http://www.python.org/dev/buildbot/all/x86%20W2k8%20trunk/builds/289/step-te... Pertinent part: test_asyncore <snip> test_asynchat command timed out: 1200 seconds without output SIGKILL failed to kill process using fake rc=-1 program finished with exit code -1 remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last): Failure: buildbot.slave.commands.TimeoutError: SIGKILL failed to kill process ] I tried to replicate it on the buildbot in order to debug, which, surprisingly, I could do consistently by just running rt.bat -q -d -uall test_asynchat. As the log above indicates, the python process becomes completely and utterly wedged, to the point that I can't even attach a remote debugger and step into it. Digging further, I noticed that if I ran the following code in two different python consoles, EADDRINUSE was *NOT* being raised by socket.bind(): import socket s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) s.bind(('127.0.0.1', 54322)) However, take out the setsockopt line, and wallah, the second s.bind() will raise EADDRINUSE, as expected. This manifests into a really bizarre issue with test_asynchat in particualr, as subsequent sock.accept() calls on the socket put python into the uber wedged state (can't even ctrl-c out at the console, need to kill the process directly). Have to leave the office and head home so I don't have any more time to look at it tonight -- just wanted to post here for others to mull over. Trent. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/tnelson%40onresolve.com _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/tnelson%40onresolve.com
On Fri, 4 Apr 2008 13:24:49 -0700, Trent Nelson
Interesting results! I committed the patch to test_socket.py in r62152. I was expecting all other platforms except for Windows to behave consistently (i.e. pass). That is, given the following:
import socket host = '127.0.0.1' sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.bind((host, 0)) port = sock.getsockname()[1] sock.close() del sock
sock1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock1.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) sock1.bind((host, port)) sock2 = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock2.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) sock2.bind((host, port)) ^^^^
....the second bind should fail with EADDRINUSE, at least according to the 'SO_REUSEADDR and SO_REUSEPORT Socket Options' section in chapter 7.5 of Stevens' UNIX Network Programming Volume 1 (2nd Ed):
"With TCP, we are never able to start multiple servers that bind the same IP address and same port: a completely duplicate binding. That is, we cannot start one server that binds 198.69.10.2 port 80 and start another that also binds 198.69.10.2 port 80, even if we set the SO_REUSEADDR socket option for the second server."
The results: both Windows *and* Linux fail the patched test; none of the buildbots for either platform encountered an EADDRINUSE socket.error after the second bind(). FreeBSD, OS X, Solaris and Tru64 pass the test -- EADDRINUSE is raised on the second bind. (Interesting that all the ones that passed have a BSD lineage.)
Notice that the quoted text explains that you cannot start multiple servers that etc. Since you didn't call listen on either socket, it's arguable that you didn't start any servers, so there should be no surprise regarding the behavior. Try adding listen calls at various places in the example and you'll see something different happen. FWIW, AIUI, SO_REUSEADDR behaves just as described in the above quote on Linux/BSD/UNIX/etc. On Windows, however, that option actually means something quite different. It means that the address should be stolen from any process which happens to be using it at the moment. There is another option, SO_EXCLUSIVEADDRUSE, only on Windows I think, which, AIUI, makes it impossible for another process to steal the port using SO_REUSEADDR. Hope this helps, Jean-Paul
"With TCP, we are never able to start multiple servers that bind the same IP address and same port: a completely duplicate binding. That is, we cannot start one server that binds 198.69.10.2 port 80 and start another that also binds 198.69.10.2 port 80, even if we set the SO_REUSEADDR socket option for the second server."
Notice that the quoted text explains that you cannot start multiple servers that etc. Since you didn't call listen on either socket, it's arguable that you didn't start any servers, so there should be no surprise regarding the behavior. Try adding listen calls at various places in the example and you'll see something different happen.
I agree in principle, Stevens says nothing about what happens if you *do* try and bind two sockets on two identical host/port addresses. Even so, test_support.bind_port() makes an assumption that bind() will raise EADDRINUSE if the port is not available, which, as has been demonstrated, won't be the case on Windows or Linux.
FWIW, AIUI, SO_REUSEADDR behaves just as described in the above quote on Linux/BSD/UNIX/etc. On Windows, however, that option actually means something quite different. It means that the address should be stolen from any process which happens to be using it at the moment.
Probably explains why the python process wedges when this happens on Windows...
There is another option, SO_EXCLUSIVEADDRUSE, only on Windows I think, which, AIUI, makes it impossible for another process to steal the port using SO_REUSEADDR.
Nod, if SO_EXCLUSIVEADDRUSE is used instead in the code I posted, Windows raises EADDRINUSE on the second bind(). I don't have access to any Linux boxes at the moment, so I can't test what sort of error is raised with the example I posted if listen() and accept() are called on the two sockets bound to identical addresses. Can anyone else shed some light on this? I'd be interested in knowing if the process wedges on Linux as badly as it does on Windows (to the point where it's not respecting ctrl-c or sigkill). Trent.
I've attached a patch (http://bugs.python.org/file9966/trunk.2550.patch) to issue 2550 that addresses the original problem here: test_support.bind_port() potentially returning ports that have already been bound to. The patch updates the tests that relied on this method, such that they call it with the new calling convention (test_ftplib, test_httplib, test_socket, test_ssl_socket, test_asynchat, test_telnetlib). Any objections to the patch? Would like to commit it sooner rather than later, as it'll fix my buildbots from wedging on test_asynchat at the very least. Trent. ________________________________________ From: python-dev-bounces+tnelson=onresolve.com@python.org [python-dev-bounces+tnelson=onresolve.com@python.org] On Behalf Of Trent Nelson [tnelson@onresolve.com] Sent: 05 April 2008 18:22 To: Jean-Paul Calderone; python-dev@python.org Subject: Re: [Python-Dev] socket.SOL_REUSEADDR: different semantics between Windows vs Unix (or why test_asynchat is sometimes dying on Windows)
"With TCP, we are never able to start multiple servers that bind the same IP address and same port: a completely duplicate binding. That is, we cannot start one server that binds 198.69.10.2 port 80 and start another that also binds 198.69.10.2 port 80, even if we set the SO_REUSEADDR socket option for the second server."
Notice that the quoted text explains that you cannot start multiple servers that etc. Since you didn't call listen on either socket, it's arguable that you didn't start any servers, so there should be no surprise regarding the behavior. Try adding listen calls at various places in the example and you'll see something different happen.
I agree in principle, Stevens says nothing about what happens if you *do* try and bind two sockets on two identical host/port addresses. Even so, test_support.bind_port() makes an assumption that bind() will raise EADDRINUSE if the port is not available, which, as has been demonstrated, won't be the case on Windows or Linux.
FWIW, AIUI, SO_REUSEADDR behaves just as described in the above quote on Linux/BSD/UNIX/etc. On Windows, however, that option actually means something quite different. It means that the address should be stolen from any process which happens to be using it at the moment.
Probably explains why the python process wedges when this happens on Windows...
There is another option, SO_EXCLUSIVEADDRUSE, only on Windows I think, which, AIUI, makes it impossible for another process to steal the port using SO_REUSEADDR.
Nod, if SO_EXCLUSIVEADDRUSE is used instead in the code I posted, Windows raises EADDRINUSE on the second bind(). I don't have access to any Linux boxes at the moment, so I can't test what sort of error is raised with the example I posted if listen() and accept() are called on the two sockets bound to identical addresses. Can anyone else shed some light on this? I'd be interested in knowing if the process wedges on Linux as badly as it does on Windows (to the point where it's not respecting ctrl-c or sigkill). Trent. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/tnelson%40onresolve.com
On Sat, Apr 5, 2008 at 1:22 PM, Trent Nelson
Nod, if SO_EXCLUSIVEADDRUSE is used instead in the code I posted, Windows raises EADDRINUSE on the second bind(). I don't have access to any Linux boxes at the moment, so I can't test what sort of error is raised with the example I posted if listen() and accept() are called on the two sockets bound to identical addresses. Can anyone else shed some light on this? I'd be interested in knowing if the process wedges on Linux as badly as it does on Windows (to the point where it's not respecting ctrl-c or sigkill).
When I call sock1.listen(5) after sock1.bind(), the test passes for me on SuSE Linux 10.1 Thanks, Raghu
participants (3)
-
Jean-Paul Calderone
-
Raghuram Devarakonda
-
Trent Nelson