test_multiprocessing: test_listener_client flakiness
I gave my Windows buildbots a little bit of TLC last night. This little chestnut in test_multiprocessing.py around line 1346 is causing my buildbots to wedge more often than not: def test_listener_client(self): for family in self.connection.families: l = self.connection.Listener(family=family) p = self.Process(target=self._test, args=(l.address,)) p.set_daemon(True) p.start() conn = l.accept() self.assertEqual(conn.recv(), 'hello') p.join() l.close() The wedging will be a result of that accept() call. Not knowing anything about the module or the test suite, I can only assume that there's a race condition introduced between when the subprocess attempts to connect to the listener, versus when the l.accept() call is actually entered. (On the basis that a race condition would explain why sometimes it wedges and sometimes it doesn't.) Just FYI, the error in the buildbot log (http://www.python.org/dev/buildbot/all/x86%20W2k8%20trunk/builds/810/step-te...) when this occurs is as follows: test_multiprocessing command timed out: 1200 seconds without output SIGKILL failed to kill process using fake rc=-1 program finished with exit code -1 remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last): Failure: buildbot.slave.commands.TimeoutError: SIGKILL failed to kill process ] (The fact it can't be killed cleanly is a bug in Twisted's signalProcess('KILL') method, which doesn't work against Python processes that have entered accept() calls on Windows (which present the 'wedged' behaviour and have to be forcibly killed with OpenProcess/TerminateProcess).) Trent.
Hello,
2008/6/18 Trent Nelson
I gave my Windows buildbots a little bit of TLC last night. This little chestnut in test_multiprocessing.py around line 1346 is causing my buildbots to wedge more often than not:
def test_listener_client(self): for family in self.connection.families: l = self.connection.Listener(family=family) p = self.Process(target=self._test, args=(l.address,)) p.set_daemon(True) p.start() conn = l.accept() self.assertEqual(conn.recv(), 'hello') p.join() l.close()
The wedging will be a result of that accept() call. Not knowing anything about the module or the test suite, I can only assume that there's a race condition introduced between when the subprocess attempts to connect to the listener, versus when the l.accept() call is actually entered. (On the basis that a race condition would explain why sometimes it wedges and sometimes it doesn't.)
Just FYI, the error in the buildbot log (http://www.python.org/dev/buildbot/all/x86%20W2k8%20trunk/builds/810/step-te...) when this occurs is as follows:
test_multiprocessing
command timed out: 1200 seconds without output SIGKILL failed to kill process using fake rc=-1 program finished with exit code -1 remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last): Failure: buildbot.slave.commands.TimeoutError: SIGKILL failed to kill process ]
(The fact it can't be killed cleanly is a bug in Twisted's signalProcess('KILL') method, which doesn't work against Python processes that have entered accept() calls on Windows (which present the 'wedged' behaviour and have to be forcibly killed with OpenProcess/TerminateProcess).)
I just found the cause of the problem ten minutes ago: It seems that when a socket listens on the address "127.0.0.1" or "localhost", another process cannot connect to it using the machine's name (even from the same machine). The best seems to listen with the empty address "". Index: Lib/multiprocessing/connection.py =================================================================== --- Lib/multiprocessing/connection.py (revision 64374) +++ Lib/multiprocessing/connection.py (working copy) @@ -49,7 +49,7 @@ Return an arbitrary free address for the given family ''' if family == 'AF_INET': - return ('localhost', 0) + return ('', 0) elif family == 'AF_UNIX': return tempfile.mktemp(prefix='listener-', dir=get_temp_dir()) elif family == 'AF_PIPE': And the test started to pass for me. Can you please check this in if it works; I don't have svn access for the moment. -- Amaury Forgeot d'Arc
On Wed, Jun 18, 2008 at 12:35 PM, Amaury Forgeot d'Arc
Hello,
2008/6/18 Trent Nelson
: I gave my Windows buildbots a little bit of TLC last night. This little chestnut in test_multiprocessing.py around line 1346 is causing my buildbots to wedge more often than not:
def test_listener_client(self): for family in self.connection.families: l = self.connection.Listener(family=family) p = self.Process(target=self._test, args=(l.address,)) p.set_daemon(True) p.start() conn = l.accept() self.assertEqual(conn.recv(), 'hello') p.join() l.close()
The wedging will be a result of that accept() call. Not knowing anything about the module or the test suite, I can only assume that there's a race condition introduced between when the subprocess attempts to connect to the listener, versus when the l.accept() call is actually entered. (On the basis that a race condition would explain why sometimes it wedges and sometimes it doesn't.)
Just FYI, the error in the buildbot log (http://www.python.org/dev/buildbot/all/x86%20W2k8%20trunk/builds/810/step-te...) when this occurs is as follows:
test_multiprocessing
command timed out: 1200 seconds without output SIGKILL failed to kill process using fake rc=-1 program finished with exit code -1 remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last): Failure: buildbot.slave.commands.TimeoutError: SIGKILL failed to kill process ]
(The fact it can't be killed cleanly is a bug in Twisted's signalProcess('KILL') method, which doesn't work against Python processes that have entered accept() calls on Windows (which present the 'wedged' behaviour and have to be forcibly killed with OpenProcess/TerminateProcess).)
I just found the cause of the problem ten minutes ago: It seems that when a socket listens on the address "127.0.0.1" or "localhost", another process cannot connect to it using the machine's name (even from the same machine). The best seems to listen with the empty address "".
Index: Lib/multiprocessing/connection.py =================================================================== --- Lib/multiprocessing/connection.py (revision 64374) +++ Lib/multiprocessing/connection.py (working copy) @@ -49,7 +49,7 @@ Return an arbitrary free address for the given family ''' if family == 'AF_INET': - return ('localhost', 0) + return ('', 0) elif family == 'AF_UNIX': return tempfile.mktemp(prefix='listener-', dir=get_temp_dir()) elif family == 'AF_PIPE':
And the test started to pass for me. Can you please check this in if it works; I don't have svn access for the moment.
-- Amaury Forgeot d'Arc
I am testing the patch locally now.
Amaury Forgeot d'Arc
if family == 'AF_INET': - return ('localhost', 0) + return ('', 0) elif family == 'AF_UNIX': return tempfile.mktemp(prefix='listener-', dir=get_temp_dir()) elif family == 'AF_PIPE':
Doesn't it have important security implications?
On Wed, 18 Jun 2008 18:35:26 +0200, Amaury Forgeot d'Arc
[snip]
I just found the cause of the problem ten minutes ago: It seems that when a socket listens on the address "127.0.0.1" or "localhost", another process cannot connect to it using the machine's name (even from the same machine).
That's because when you try to connect to A:B you won't connect to a server listening on X:B - somewhat by design. Changing the test to listen on A:B and X:B might fix it, but so would changing it to connect to the same address it listens to.
The best seems to listen with the empty address "".
This will cause it to listen on all available interfaces, including public ones. It's rather unlikely that someone from the internet will connect to the port while the test suite is running and use it to do terrible things to you, but it's not /impossible/. I'd suggest changing the client to connect to "127.0.0.1" or "localhost", instead. Jean-Paul
Hello,
2008/6/18 Jean-Paul Calderone
On Wed, 18 Jun 2008 18:35:26 +0200, Amaury Forgeot d'Arc
wrote: [snip]
I just found the cause of the problem ten minutes ago: It seems that when a socket listens on the address "127.0.0.1" or "localhost", another process cannot connect to it using the machine's name (even from the same machine).
That's because when you try to connect to A:B you won't connect to a server listening on X:B - somewhat by design. Changing the test to listen on A:B and X:B might fix it, but so would changing it to connect to the same address it listens to.
The best seems to listen with the empty address "".
This will cause it to listen on all available interfaces, including public ones. It's rather unlikely that someone from the internet will connect to the port while the test suite is running and use it to do terrible things to you, but it's not /impossible/.
I'd suggest changing the client to connect to "127.0.0.1" or "localhost", instead.
OK, I am not a tcp expert. How about the following patch? It seems to me that it is wrong for the server to lie about the address it listens to. Index: ../../Lib/multiprocessing/connection.py =================================================================== --- ../../Lib/multiprocessing/connection.py (revision 64378) +++ ../../Lib/multiprocessing/connection.py (working copy) @@ -216,8 +216,6 @@ self._socket.bind(address) self._socket.listen(backlog) address = self._socket.getsockname() - if type(address) is tuple: - address = (socket.getfqdn(address[0]),) + address[1:] self._address = address self._family = family self._last_accepted = None -- Amaury Forgeot d'Arc
Amaury Forgeot d'Arc
OK, I am not a tcp expert. How about the following patch? It seems to me that it is wrong for the server to lie about the address it listens to.
Maybe I'm missing something but... if the client part defaults to connecting to "localhost" in TCP mode, why doesn't the server part also default to listening on "localhost" in TCP mode? Having different address detection algorithms on both ends sounds like a recipe for bugs like this. In any case, perhaps one should ask guidance to the original author before making important changes like this one.
On Wed, Jun 18, 2008 at 1:51 PM, Antoine Pitrou
Amaury Forgeot d'Arc
writes: OK, I am not a tcp expert. How about the following patch? It seems to me that it is wrong for the server to lie about the address it listens to.
Maybe I'm missing something but... if the client part defaults to connecting to "localhost" in TCP mode, why doesn't the server part also default to listening on "localhost" in TCP mode? Having different address detection algorithms on both ends sounds like a recipe for bugs like this.
In any case, perhaps one should ask guidance to the original author before making important changes like this one.
The server component *is* listening to localhost by default. The client in the code I pasted has the server's .address attribute passed into it to tell the client where to connect to. -jesse
Jesse Noller
The server component *is* listening to localhost by default. The client in the code I pasted has the server's .address attribute passed into it to tell the client where to connect to.
So I take it that the .address attribute is different from "localhost", but why? Is there any point in trying to be clever here?
On Wed, Jun 18, 2008 at 2:09 PM, Antoine Pitrou
Jesse Noller
writes: The server component *is* listening to localhost by default. The client in the code I pasted has the server's .address attribute passed into it to tell the client where to connect to.
So I take it that the .address attribute is different from "localhost", but why? Is there any point in trying to be clever here?
No, it is *not* different than localhost - it *is* localhost as that is the object.address of the server.
2008/6/18 Jesse Noller
On Wed, Jun 18, 2008 at 2:09 PM, Antoine Pitrou
wrote: Jesse Noller
writes: The server component *is* listening to localhost by default. The client in the code I pasted has the server's .address attribute passed into it to tell the client where to connect to.
So I take it that the .address attribute is different from "localhost", but why? Is there any point in trying to be clever here?
No, it is *not* different than localhost - it *is* localhost as that is the object.address of the server.
Well, on my win2k machine getfqdn('127.0.0.1') returns the actual name of the machine. This is the name that was stored in server.address. Hence my patch that simply remove the call to getfqdn. -- Amaury Forgeot d'Arc
Amaury Forgeot d'Arc wrote:
2008/6/18 Jesse Noller
: On Wed, Jun 18, 2008 at 2:09 PM, Antoine Pitrou
wrote: Jesse Noller
writes: The server component *is* listening to localhost by default. The client in the code I pasted has the server's .address attribute passed into it to tell the client where to connect to. So I take it that the .address attribute is different from "localhost", but why? Is there any point in trying to be clever here? No, it is *not* different than localhost - it *is* localhost as that is the object.address of the server.
Well, on my win2k machine getfqdn('127.0.0.1') returns the actual name of the machine. This is the name that was stored in server.address. Hence my patch that simply remove the call to getfqdn.
Point of interest: I was involved in developing an application where we gave up on trying to persuade getfqdn() to return reliable results on Windows machines - regardless of which local network interface you use to do the lookup, Windows 2k3 is likely to return an answer of 'computername.computerdomain' without bothering to consult the local DNS server (annoyingly, it used to do this right on previous Windows versions). So I'm inclined to consider any code which expects a useful answer out of getfqdn() to be non-portable at best and outright broken at worst. We worked around the problem by only doing forward DNS lookups and always working with addresses in the IP domain. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
Well, on my win2k machine getfqdn('127.0.0.1') returns the actual name of the machine. This is the name that was stored in server.address. Hence my patch that simply remove the call to getfqdn.
+1 to ditching getfqdn, following patch fixes the issue on my buildbot server: Index: connection.py =================================================================== --- connection.py (revision 64369) +++ connection.py (working copy) @@ -215,10 +215,7 @@ self._socket = socket.socket(getattr(socket, family)) self._socket.bind(address) self._socket.listen(backlog) - address = self._socket.getsockname() - if type(address) is tuple: - address = (socket.getfqdn(address[0]),) + address[1:] - self._address = address + self._address = self._socket.getsockname() self._family = family self._last_accepted = None Trent.
Thanks Trent, I'll apply the diff and run the tests in a tight loop
after re-enabling the client listener tests. I appreciate you tracking
this down
On Thu, Jun 19, 2008 at 2:07 AM, Trent Nelson
Well, on my win2k machine getfqdn('127.0.0.1') returns the actual name of the machine. This is the name that was stored in server.address. Hence my patch that simply remove the call to getfqdn.
+1 to ditching getfqdn, following patch fixes the issue on my buildbot server:
Index: connection.py =================================================================== --- connection.py (revision 64369) +++ connection.py (working copy) @@ -215,10 +215,7 @@ self._socket = socket.socket(getattr(socket, family)) self._socket.bind(address) self._socket.listen(backlog) - address = self._socket.getsockname() - if type(address) is tuple: - address = (socket.getfqdn(address[0]),) + address[1:] - self._address = address + self._address = self._socket.getsockname() self._family = family self._last_accepted = None
Trent.
This is a common problem. Binding to '127.0.0.1' will bind to *only* that address; binding to "" will bind to *all* addresses the machine is known by. If the client uses a different way of getting the address than the literal '127.0.0.1' (like a call to getfqdn(), which has pretty indeterminate results), you should use the "" form when calling bind. If you know you want to talk to '127.0.0.1', just use the tuple ('127.0.0.1', <PORT>) as the address. Bill
Well, on my win2k machine getfqdn('127.0.0.1') returns the actual name of the machine. This is the name that was stored in server.address. Hence my patch that simply remove the call to getfqdn.
+1 to ditching getfqdn, following patch fixes the issue on my buildbot server:
Index: connection.py =================================================================== --- connection.py (revision 64369) +++ connection.py (working copy) @@ -215,10 +215,7 @@ self._socket = socket.socket(getattr(socket, family)) self._socket.bind(address) self._socket.listen(backlog) - address = self._socket.getsockname() - if type(address) is tuple: - address = (socket.getfqdn(address[0]),) + address[1:] - self._address = address + self._address = self._socket.getsockname() self._family = family self._last_accepted = None
Trent. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/janssen%40parc.com
This is a common problem. Binding to '127.0.0.1' will bind to *only* that address;
Indeed.
binding to "" will bind to *all* addresses the machine is known by.
Agreed again. I believe what we're dealing with here though is a lack of clarity regarding what role the 'address' attribute exposed by multiprocess.connection.Listener should play. The way test_listener_client() is written, it effectively treats 'address' as an end-point that can be connected to directly (irrespective of the underlying family (i.e. AF_INET, AF_UNIX, AF_PIPE)). I believe the problems we've run into stem from the fact that the API doesn't provide any guarantees as to what 'address' represents. The test suite assumes it always reflects a connectable end-point, which I think is more than reasonable. Unfortunately, nothing stops us from breaking this invariant by constructing the object as Listener(family='AF_INET', address=('0.0.0.0', 0)). How do I connect to an AF_INET Listener (i.e. SocketListener) instance whose 'address' attribute reports '0.0.0.0' as the host? I can't. So, for now, I think we should enforce this invariant by raising an exception in Listener.__init__() if self._socket.getsockbyname()[0] returns '0.0.0.0'. In effect, tightening up the API such that we can guarantee Listener.address will always represent a connectable end-point. We can look at how to service 'listen on all available interfaces' semantics at a later date -- that adds far less value IMO than being able to depend on the said guarantee. Thoughts? Trent.
The server component *is* listening to localhost by default. The client in the code I pasted has the server's .address attribute passed into it to tell the client where to connect to.
Interestingly enough, I can't coerce test_multiprocessing.test_listener_client to wedge on my XP laptop. When I run it on my Windows Server 2008 box (same box the buildbots run on), it wedges every time. So, I suspect this isn't actually a race like I first thought, but something quirky with 2008. Will dig around for a bit longer and see what I can find... Trent.
I just found the cause of the problem ten minutes ago: It seems that when a socket listens on the address "127.0.0.1" or "localhost", another process cannot connect to it using the machine's name (even from the same machine). The best seems to listen with the empty address "".
That doesn't seem right. If that were the case, none of the hundreds of network-oriented tests would work either. These all bind against ('localhost', 0) by default which seems to Do The Right Thing on all the platforms we have buildbots for. Trent.
On Wed, Jun 18, 2008 at 12:20 PM, Trent Nelson
I gave my Windows buildbots a little bit of TLC last night. This little chestnut in test_multiprocessing.py around line 1346 is causing my buildbots to wedge more often than not:
def test_listener_client(self): for family in self.connection.families: l = self.connection.Listener(family=family) p = self.Process(target=self._test, args=(l.address,)) p.set_daemon(True) p.start() conn = l.accept() self.assertEqual(conn.recv(), 'hello') p.join() l.close()
The wedging will be a result of that accept() call. Not knowing anything about the module or the test suite, I can only assume that there's a race condition introduced between when the subprocess attempts to connect to the listener, versus when the l.accept() call is actually entered. (On the basis that a race condition would explain why sometimes it wedges and sometimes it doesn't.)
Just FYI, the error in the buildbot log (http://www.python.org/dev/buildbot/all/x86%20W2k8%20trunk/builds/810/step-te...) when this occurs is as follows:
test_multiprocessing
command timed out: 1200 seconds without output SIGKILL failed to kill process using fake rc=-1 program finished with exit code -1 remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last): Failure: buildbot.slave.commands.TimeoutError: SIGKILL failed to kill process ]
(The fact it can't be killed cleanly is a bug in Twisted's signalProcess('KILL') method, which doesn't work against Python processes that have entered accept() calls on Windows (which present the 'wedged' behaviour and have to be forcibly killed with OpenProcess/TerminateProcess).)
Trent. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/jnoller%40gmail.com
Trent, I commented out the test causing the issue for now (the entire suite is being revisited post beta-1) See r64378 -jesse
participants (7)
-
Amaury Forgeot d'Arc
-
Antoine Pitrou
-
Bill Janssen
-
Jean-Paul Calderone
-
Jesse Noller
-
Nick Coghlan
-
Trent Nelson