[Gllug] Checkpoint/Restart

John Hearns john.hearns at streamline-computing.com
Tue Feb 26 17:10:33 UTC 2008


On Tue, 2008-02-26 at 12:48 +0000, Mick Farmer wrote:
> Dear GLLUGers,
> 
> One of our researchers needs to run a program for a long
> time (probably a week or more), so I thought it might be
> useful if we provided a checkpoint/restart facility.


We don't deal with system level checkpointing,
more at the application level.

The particular MPI implementation we use (www.pccluster.org) has very
good checkpointing facilities, and we can checkpoint/restart parallel
jobs. I'd guess this is no use to you really.

Here's a HOWTO on checkpointing with Sun Gridengine - only if you read
it it is really about methods to signal to your job to do its own
checkpointing (ie Gridengine does not have inbuilt checkpointing).

http://gridengine.sunsource.net/howto/checkpointing.html
It mentions using the Condor libraries, though that is something I've
never done.http://www.cs.wisc.edu/condor/checkpointing.html


-- 
Gllug mailing list  -  Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug




More information about the GLLUG mailing list