[IPython-dev] Before a patch for LSF support

Brian Granger ellisonbg.net at gmail.com
Wed Aug 12 05:58:13 EDT 2009

This looks like a basic TCP/IP connection issue.  It could be related to a
number of things.  One thing to keep in mind is the direction of the
connections.  The controller need to start first - it listens on a port and
then the engines connect to it.  The host that the controller is on needs to
allow incoming TCP/IP connections.  The hosts with the engine need to allow
outgoing connections.

Have a look at the following:

* Firewall.  If a fire wall is blocking the engine from connecting to the
controller you will see this type of error.  A fire wall like this would be
unusual though (I have never seen one before).  To test this, start the
controller on the head node, ssh to a compute node and then just telnet (it
will fail) to the controller.  But you should see the connection start to
happen.  You could also run ipengine by hand on the compute node.
* If the controller hasn't been started or failed to start, you would also
see this.  Look at the controller logs to see if this is going on.
* If there is NAT (network address translation) on the cluster.  This is
pretty common.  Typically this would be that the head node has multiple
network interfaces, one for the outside world and one for talking to the
compute nodes.  In this case, you will need to use ifconfig to hunt down the
right ip address.  Then you will need to use the --engine-ip flag to
ipcontroller to set the ip address that the engines will connect to.  The
engines get this from the furl file that the controller writes.

I am betting that the 2nd or 3rd of these is going on.  Keep us posted as
these things can be pretty tough to debug because of how some clusters are
setup.  But, take heart, I have never encountered a system that we could get
working - and this includes some pretty crazy systems.



On Wed, Aug 12, 2009 at 12:15 AM, Matthieu Brucher <
matthieu.brucher at gmail.com> wrote:

> 2009/8/11 Matthieu Brucher <matthieu.brucher at gmail.com>:
> >>> 4.  Possibly add logic for copying the furl files around or for setting
> the
> >>> command line options to point to them is they are on different
> locations.
> >>
> >> This may be the only thing that I couldn't check.
> >
> > OK, I only have an issue with this at the moment. This is a log from the
> engine:
> >
> > 2009-08-11 14:04:44+0200 [-] Log opened.
> > 2009-08-11 14:04:44+0200 [-] Using furl file:
> > /users/brucher/.ipython/security/ipcontroller-engine.furl
> > 2009-08-11 14:04:44+0200 [Uninitialized] 'EngineConnector: engine
> > registration failed:'
> > 2009-08-11 14:04:44+0200 [Uninitialized] Unhandled Error
> >        Traceback (most recent call last):
> >        Failure: twisted.internet.error.ConnectionRefusedError: Connection
> > was refused by other side: 111: Connection refused.
> >
> > 2009-08-11 14:04:44+0200 [Uninitialized] error connecting to
> > controller: Connection was refused by other side: 111: Connection
> > refused.
> > 2009-08-11 14:04:44+0200 [-] Main loop terminated.
> >
> > It is launch correctly by LSF, it is thus only a matter of setting the
> > connection correctly.
> >
> > Matthieu
> Is there a simple way to test the connections with foolscape?
> Matthieu
> --
> Information System Engineer, Ph.D.
> Website: http://matthieu-brucher.developpez.com/
> Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
> LinkedIn: http://www.linkedin.com/in/matthieubrucher
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20090812/6eb27151/attachment.html>

More information about the IPython-dev mailing list