[Baypiggies] clustering

Carl J. Van Arsdall cvanarsdall at mvista.com
Tue Sep 5 19:22:30 CEST 2006

Shannon -jj Behrens wrote:
> Hey Guys,
> I need to do some data processing, and I'd like to use a cluster so
> that I don't have to grow old waiting for my computer to finish.  I'm
> thinking about using the servers I have locally.  I'm completely new
> to clustering.  I understand how to break a problem up into
> paralizable pieces, but I don't understand the admin side of it.  My
> current data set is about 16 gigs, and I need to do things like run
> filters over strings, make sure strings are unique, etc.  I'll be
> using Python wherever possible.
> * Do I have to run a particular Linux distro?  Do they all have to be
> the same, or can I just setup a daemon on each machine?
 From what I've seen this can vary.  For example if you are using PVM 
then you should be able to have a heterogeneous cluster without too much 
difficulty.  Although, personally, for ease of adminsitration, shit like 
that, I prefer to keep things (at least on the software side) as similar 
as I can.  The reality of the cluster is what you make of it

> * What does "Beowulf" do for me?

Beowulf isn't so great.  There are a number of "active" clustering 
technologies going on.  I've seen a bit about OpenMosix passed around, 
although I believe it exists as kernel patches that are somewhat dated 
last time I checked (they were for 2.4 kernels).  If you have a lot of 
machines etc, you might even want to google load balancing clusters and 
see what you get.
> * How do I admin all the boxes without having to enter the same command n times?
Check out dsh - dancer's shell.   If you are running a debian distro you 
can just apt-get it, I use it all the time, a really handy tool.

> * I've heard that MPI is good and standard.  Should I use it?  Can I
> use it with Python programs?
As far as parallel programs go, MPI (and sometimes PVM) tend to be the 
best ways to achieve maximum speed although they tend to incur more 
development overhead.  Lots of people also use combinations of MPI and 
OpenMP (or pthreads, whatev, openMP is nice and easy and soon to be 
standard in gcc) when they have clusters of smp machines.  In my 
experience, when you have lots of data to move around it can definitely 
be to your advantage to use MPI as you can control specifically how data 
will be passed around and setup a network to match that.  With 16 gigs 
of data you will really want to look at your network topology and how 
you choose to distribute the data.

> * Is there anything better than NFS that I could use to access the data?
I've seen a number of different ways to do this.  You can google 
distributed shared file systems, I think there are a couple projects out 
there, although I've never used any of them and I'd be very much 
interested in anyone's stories if they had any.

> * What hip, slick, and cool these days?
You might even check out some grid computing stuff, kinda neat imho.  
Also, when you get a cluster up and running with MPI or whatever you 
might want to go as far as to profile your code and find the serious 
bottlenecks in your application.  Check out TAU (Tuning Analysis and 
Utilities), it has python bindings as well as MPI/OpenMP stuff.  Not 
that you will use it, that's just one of those things you can google 
should you be bored at work or interested in that typa stuff, and its a 
good way to justify to your employer why you need to install infiniband 
as your network ;)

> I just need you point me in the right direction and tell me what's
> good and what's a waste of time.
Well, as you know you prob want to avoid python threads, although I've 
set up a fairly primitive distributed system with python threads and 
ssh.  Everything is I/O bound for me, so it works really well, although 
I'm looking into better distributed technologies.  Just more stuff to 
play with as we learn (and i'm reading all the links people have posted 
in response to your questions too, lots of good stuff)!  I'd also be 
interested in the solution you choose, so if you ever want to post a 
follow up thread I'd be happy to read the results of your project!



Carl J. Van Arsdall
cvanarsdall at mvista.com
Build and Release
MontaVista Software

More information about the Baypiggies mailing list