[Gllug] Distributed batch processing

John Hearns john.hearns at streamline-computing.com
Sun Dec 3 15:57:17 UTC 2006


Dylan wrote:
> 
> 
> Bearing in mind that there are usually several machines around here with 
> resources to spare, I figure I could streamline the process by distributing 
> the work around whatever machines are available at the time.
> 
> Does anyone have any experience with this kind of thing? What systems are 
> there available to help in the process?

Emmmm.. you called :-)
25 hours a day, eight days a wee down t' Grid mines.

I know Sun Gridengine very well. That would do the job no problem.
Make sure you have the same UIDs on all machines (OK, you can alias 
users, but no need to make it complicated). Plus some NFS shared storage 
area, or a NAS or SAN volume of course. IF you don't have that, it is 
easy to 'pre-stage' and 'post-stage' the data into and out of /tmp on 
the systems.
http://gridengine.sunsource.net/
(ps, even though it is a Sun product, SGE is free as in beer ans SISSL 
license). You really do get a very capable batch scheduling system for 
free. We run it on many, many systems including a 500 machine farm.

Once you follow the install instructions, you will automatically have an 
  'all.q' with an instance on each machine.
To do your processing, you can open an interactive session (like an rsh) by:
qrsh  my-program-to-run <arguments>
Or create a shell script and 'qsub script'
Easy peasy - you can even get it to email you when it finishes.

An alternative would be OpenPBS/Torque http://www.openpbs.org/


For your project, also very much worth considering is Condor.
This is the classically-used model when wanting to use up spare cycles 
on a lab of computers, or spare cycles in a campus 
setup.http://www.cs.wisc.edu/condor/

Also look at Mosix. This checkpoints and restarts processes on machines 
which are idle. Drawback is needing the same kernel on all machines.
HOWEVER, there is a live-CD of Mosix, ClusterKnoppix
  http://clusterknoppix.sw.be/
You select ona machine as the master, boot it off the live CD,
then network boot the rest of the network.



If you want a pay-for batch system, with support for many commercial 
codes, look at LSF. http://www.platform.com/
We install/configure/troubleshoot this for many commercial customers.





Finally, my advice.
Have a good look at setting up a Condor pool on your set of PCs.
Then also download/install Gridengine version 6.
The install is easy - just unpack the two tarballs,
and run an install script. Create an 'sge' user beforehand.
You should have a list of the 'worker' machines to hand. It will then 
set everything up for you. If you have rsh access setup to the worker 
nodes you will install onto them (its easy to manually install - its 
just an init script plus sharing the SGE install directory.




Commercial plug - I do this sort of thing for a living. If a company 
wants help setting these up, we can help.


-- 
      John Hearns
      Senior HPC Engineer
      Streamline Computing,
      The Innovation Centre, Warwick Technology Park,
      Gallows Hill, Warwick CV34 6UW
      Office: 01926 623130 Mobile: 07841 231235
-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug




More information about the GLLUG mailing list