[IPython-dev] Before a patch for LSF support

Matthieu Brucher matthieu.brucher at gmail.com
Wed Aug 12 11:06:00 EDT 2009


> * Firewall.  If a fire wall is blocking the engine from connecting to the
> controller you will see this type of error.  A fire wall like this would be
> unusual though (I have never seen one before).  To test this, start the
> controller on the head node, ssh to a compute node and then just telnet (it
> will fail) to the controller.  But you should see the connection start to
> happen.  You could also run ipengine by hand on the compute node.

No worries on this side. We do a lot of client/server stuff, it did
work with telnet.

> * If the controller hasn't been started or failed to start, you would also
> see this.  Look at the controller logs to see if this is going on.

It seems the controller was launched (and as I can telnet it, I think
it is online?):

2009-08-12 16:59:52+0200 [-] Log opened.
2009-08-12 16:59:52+0200 [-] Process ['ipcontroller',
'--logfile=/users/brucher/.ipython/log/ipcontroller'] has started with
pid=5001
2009-08-12 16:59:52+0200 [-] Waiting for controller to finish starting...
2009-08-12 16:59:55+0200 [-] Controller started
2009-08-12 16:59:55+0200 [-] Using template for batch script: lsf.template
2009-08-12 16:59:55+0200 [-] Writing instantiated batch script: lsf.template-run
2009-08-12 16:59:55+0200 [-] Job started with job id: '6166'

> * If there is NAT (network address translation) on the cluster.  This is
> pretty common. Typically this would be that the head node has multiple
> network interfaces, one for the outside world and one for talking to the
> compute nodes.  In this case, you will need to use ifconfig to hunt down the
> right ip address.  Then you will need to use the --engine-ip flag to
> ipcontroller to set the ip address that the engines will connect to.  The
> engines get this from the furl file that the controller writes.

I don't think there is something like that here. I can connect to the
LSF nodes with ssh and then telnet the controller: it works with the
IP address indicated in the furl.

> I am betting that the 2nd or 3rd of these is going on.  Keep us posted as
> these things can be pretty tough to debug because of how some clusters are
> setup.  But, take heart, I have never encountered a system that we could get
> working - and this includes some pretty crazy systems.

I suppose you meant the contrary ;)
I still have hope to get it working in the near future :D

At least, I have also the LSF logs, but they do not show a thing, as
everything is output in the ipengine logs.

Cheers,

Matthieu
-- 
Information System Engineer, Ph.D.
Website: http://matthieu-brucher.developpez.com/
Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn: http://www.linkedin.com/in/matthieubrucher



More information about the IPython-dev mailing list