[Baypiggies] clustering
Paul Marxhausen
pmarxhausen at yahoo.com
Tue Sep 12 08:39:31 CEST 2006
Hi,
see feature articles at
http://www.linuxjournal.com/issue/149 for practical
beginner info on Beowulf, Condor, Heartbeat, and
parallel programming (also useful references and contacts).
Cheers,
Paul Marxhausen
--- "Carl J. Van Arsdall" <cvanarsdall at mvista.com> wrote:
> Shannon -jj Behrens wrote:
> > Hey Guys,
> >
> > I need to do some data processing, and I'd like to use a cluster so
> > that I don't have to grow old waiting for my computer to finish.
> I'm
> > thinking about using the servers I have locally. I'm completely
> new
> > to clustering. I understand how to break a problem up into
> > paralizable pieces, but I don't understand the admin side of it.
> My
> > current data set is about 16 gigs, and I need to do things like run
> > filters over strings, make sure strings are unique, etc. I'll be
> > using Python wherever possible.
> >
> > * Do I have to run a particular Linux distro? Do they all have to
> be
> > the same, or can I just setup a daemon on each machine?
> >
> From what I've seen this can vary. For example if you are using PVM
>
> then you should be able to have a heterogeneous cluster without too
> much
> difficulty. Although, personally, for ease of adminsitration, shit
> like
> that, I prefer to keep things (at least on the software side) as
> similar
> as I can. The reality of the cluster is what you make of it
>
>
> > * What does "Beowulf" do for me?
> >
>
> Beowulf isn't so great. There are a number of "active" clustering
> technologies going on. I've seen a bit about OpenMosix passed
> around,
> although I believe it exists as kernel patches that are somewhat
> dated
> last time I checked (they were for 2.4 kernels). If you have a lot
> of
> machines etc, you might even want to google load balancing clusters
> and
> see what you get.
> > * How do I admin all the boxes without having to enter the same
> command n times?
> >
> Check out dsh - dancer's shell. If you are running a debian distro
> you
> can just apt-get it, I use it all the time, a really handy tool.
>
>
> > * I've heard that MPI is good and standard. Should I use it? Can
> I
> > use it with Python programs?
> >
> As far as parallel programs go, MPI (and sometimes PVM) tend to be
> the
> best ways to achieve maximum speed although they tend to incur more
> development overhead. Lots of people also use combinations of MPI
> and
> OpenMP (or pthreads, whatev, openMP is nice and easy and soon to be
> standard in gcc) when they have clusters of smp machines. In my
> experience, when you have lots of data to move around it can
> definitely
> be to your advantage to use MPI as you can control specifically how
> data
> will be passed around and setup a network to match that. With 16
> gigs
> of data you will really want to look at your network topology and how
>
> you choose to distribute the data.
>
>
>
> > * Is there anything better than NFS that I could use to access the
> data?
> >
> I've seen a number of different ways to do this. You can google
> distributed shared file systems, I think there are a couple projects
> out
> there, although I've never used any of them and I'd be very much
> interested in anyone's stories if they had any.
>
>
> > * What hip, slick, and cool these days?
> >
> You might even check out some grid computing stuff, kinda neat imho.
>
> Also, when you get a cluster up and running with MPI or whatever you
> might want to go as far as to profile your code and find the serious
> bottlenecks in your application. Check out TAU (Tuning Analysis and
> Utilities), it has python bindings as well as MPI/OpenMP stuff. Not
> that you will use it, that's just one of those things you can google
> should you be bored at work or interested in that typa stuff, and its
> a
> good way to justify to your employer why you need to install
> infiniband
> as your network ;)
>
>
> > I just need you point me in the right direction and tell me what's
> > good and what's a waste of time.
> >
> Well, as you know you prob want to avoid python threads, although
> I've
> set up a fairly primitive distributed system with python threads and
> ssh. Everything is I/O bound for me, so it works really well,
> although
> I'm looking into better distributed technologies. Just more stuff to
>
> play with as we learn (and i'm reading all the links people have
> posted
> in response to your questions too, lots of good stuff)! I'd also be
> interested in the solution you choose, so if you ever want to post a
> follow up thread I'd be happy to read the results of your project!
>
>
> -carl
>
>
> --
>
> Carl J. Van Arsdall
> cvanarsdall at mvista.com
> Build and Release
> MontaVista Software
>
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> http://mail.python.org/mailman/listinfo/baypiggies
>
More information about the Baypiggies
mailing list