[IPython-dev] Before a patch for LSF support
Brian Granger
ellisonbg.net at gmail.com
Wed Aug 12 12:17:45 EDT 2009
Matthieu,
Can you do the following test:
headnode> ipcontroller
# Then copy the engines furl file to the compute node and do in a separate
terminal:
computenode> ipengine --furl-file=[path to the furl file]
If that doesn't work, it is either:
* IP address related issue. Play with ifconfig and ipcontroller --engine-ip
* Firewall. But you said this wasn't an issue.
Hope this helps.
Cheers,
Brian
On Wed, Aug 12, 2009 at 8:06 AM, Matthieu Brucher <
matthieu.brucher at gmail.com> wrote:
> > * Firewall. If a fire wall is blocking the engine from connecting to the
> > controller you will see this type of error. A fire wall like this would
> be
> > unusual though (I have never seen one before). To test this, start the
> > controller on the head node, ssh to a compute node and then just telnet
> (it
> > will fail) to the controller. But you should see the connection start to
> > happen. You could also run ipengine by hand on the compute node.
>
> No worries on this side. We do a lot of client/server stuff, it did
> work with telnet.
>
> > * If the controller hasn't been started or failed to start, you would
> also
> > see this. Look at the controller logs to see if this is going on.
>
> It seems the controller was launched (and as I can telnet it, I think
> it is online?):
>
> 2009-08-12 16:59:52+0200 [-] Log opened.
> 2009-08-12 16:59:52+0200 [-] Process ['ipcontroller',
> '--logfile=/users/brucher/.ipython/log/ipcontroller'] has started with
> pid=5001
> 2009-08-12 16:59:52+0200 [-] Waiting for controller to finish starting...
> 2009-08-12 16:59:55+0200 [-] Controller started
> 2009-08-12 16:59:55+0200 [-] Using template for batch script: lsf.template
> 2009-08-12 16:59:55+0200 [-] Writing instantiated batch script:
> lsf.template-run
> 2009-08-12 16:59:55+0200 [-] Job started with job id: '6166'
>
> > * If there is NAT (network address translation) on the cluster. This is
> > pretty common. Typically this would be that the head node has multiple
> > network interfaces, one for the outside world and one for talking to the
> > compute nodes. In this case, you will need to use ifconfig to hunt down
> the
> > right ip address. Then you will need to use the --engine-ip flag to
> > ipcontroller to set the ip address that the engines will connect to. The
> > engines get this from the furl file that the controller writes.
>
> I don't think there is something like that here. I can connect to the
> LSF nodes with ssh and then telnet the controller: it works with the
> IP address indicated in the furl.
>
> > I am betting that the 2nd or 3rd of these is going on. Keep us posted as
> > these things can be pretty tough to debug because of how some clusters
> are
> > setup. But, take heart, I have never encountered a system that we could
> get
> > working - and this includes some pretty crazy systems.
>
> I suppose you meant the contrary ;)
> I still have hope to get it working in the near future :D
>
> At least, I have also the LSF logs, but they do not show a thing, as
> everything is output in the ipengine logs.
>
> Cheers,
>
> Matthieu
> --
> Information System Engineer, Ph.D.
> Website: http://matthieu-brucher.developpez.com/
> Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
> LinkedIn: http://www.linkedin.com/in/matthieubrucher
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20090812/1428d457/attachment.html>
More information about the IPython-dev
mailing list