[IPython-dev] Before a patch for LSF support
matthieu.brucher at gmail.com
Wed Aug 12 11:06:00 EDT 2009
> * Firewall. If a fire wall is blocking the engine from connecting to the
> controller you will see this type of error. A fire wall like this would be
> unusual though (I have never seen one before). To test this, start the
> controller on the head node, ssh to a compute node and then just telnet (it
> will fail) to the controller. But you should see the connection start to
> happen. You could also run ipengine by hand on the compute node.
No worries on this side. We do a lot of client/server stuff, it did
work with telnet.
> * If the controller hasn't been started or failed to start, you would also
> see this. Look at the controller logs to see if this is going on.
It seems the controller was launched (and as I can telnet it, I think
it is online?):
2009-08-12 16:59:52+0200 [-] Log opened.
2009-08-12 16:59:52+0200 [-] Process ['ipcontroller',
'--logfile=/users/brucher/.ipython/log/ipcontroller'] has started with
2009-08-12 16:59:52+0200 [-] Waiting for controller to finish starting...
2009-08-12 16:59:55+0200 [-] Controller started
2009-08-12 16:59:55+0200 [-] Using template for batch script: lsf.template
2009-08-12 16:59:55+0200 [-] Writing instantiated batch script: lsf.template-run
2009-08-12 16:59:55+0200 [-] Job started with job id: '6166'
> * If there is NAT (network address translation) on the cluster. This is
> pretty common. Typically this would be that the head node has multiple
> network interfaces, one for the outside world and one for talking to the
> compute nodes. In this case, you will need to use ifconfig to hunt down the
> right ip address. Then you will need to use the --engine-ip flag to
> ipcontroller to set the ip address that the engines will connect to. The
> engines get this from the furl file that the controller writes.
I don't think there is something like that here. I can connect to the
LSF nodes with ssh and then telnet the controller: it works with the
IP address indicated in the furl.
> I am betting that the 2nd or 3rd of these is going on. Keep us posted as
> these things can be pretty tough to debug because of how some clusters are
> setup. But, take heart, I have never encountered a system that we could get
> working - and this includes some pretty crazy systems.
I suppose you meant the contrary ;)
I still have hope to get it working in the near future :D
At least, I have also the LSF logs, but they do not show a thing, as
everything is output in the ipengine logs.
Information System Engineer, Ph.D.
Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
More information about the IPython-dev