[Gllug] Distributed batch processing
John Hearns
john.hearns at streamline-computing.com
Sun Dec 3 15:57:17 UTC 2006
Dylan wrote:
>
>
> Bearing in mind that there are usually several machines around here with
> resources to spare, I figure I could streamline the process by distributing
> the work around whatever machines are available at the time.
>
> Does anyone have any experience with this kind of thing? What systems are
> there available to help in the process?
Emmmm.. you called :-)
25 hours a day, eight days a wee down t' Grid mines.
I know Sun Gridengine very well. That would do the job no problem.
Make sure you have the same UIDs on all machines (OK, you can alias
users, but no need to make it complicated). Plus some NFS shared storage
area, or a NAS or SAN volume of course. IF you don't have that, it is
easy to 'pre-stage' and 'post-stage' the data into and out of /tmp on
the systems.
http://gridengine.sunsource.net/
(ps, even though it is a Sun product, SGE is free as in beer ans SISSL
license). You really do get a very capable batch scheduling system for
free. We run it on many, many systems including a 500 machine farm.
Once you follow the install instructions, you will automatically have an
'all.q' with an instance on each machine.
To do your processing, you can open an interactive session (like an rsh) by:
qrsh my-program-to-run <arguments>
Or create a shell script and 'qsub script'
Easy peasy - you can even get it to email you when it finishes.
An alternative would be OpenPBS/Torque http://www.openpbs.org/
For your project, also very much worth considering is Condor.
This is the classically-used model when wanting to use up spare cycles
on a lab of computers, or spare cycles in a campus
setup.http://www.cs.wisc.edu/condor/
Also look at Mosix. This checkpoints and restarts processes on machines
which are idle. Drawback is needing the same kernel on all machines.
HOWEVER, there is a live-CD of Mosix, ClusterKnoppix
http://clusterknoppix.sw.be/
You select ona machine as the master, boot it off the live CD,
then network boot the rest of the network.
If you want a pay-for batch system, with support for many commercial
codes, look at LSF. http://www.platform.com/
We install/configure/troubleshoot this for many commercial customers.
Finally, my advice.
Have a good look at setting up a Condor pool on your set of PCs.
Then also download/install Gridengine version 6.
The install is easy - just unpack the two tarballs,
and run an install script. Create an 'sge' user beforehand.
You should have a list of the 'worker' machines to hand. It will then
set everything up for you. If you have rsh access setup to the worker
nodes you will install onto them (its easy to manually install - its
just an init script plus sharing the SGE install directory.
Commercial plug - I do this sort of thing for a living. If a company
wants help setting these up, we can help.
--
John Hearns
Senior HPC Engineer
Streamline Computing,
The Innovation Centre, Warwick Technology Park,
Gallows Hill, Warwick CV34 6UW
Office: 01926 623130 Mobile: 07841 231235
--
Gllug mailing list - Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug
More information about the GLLUG
mailing list