[Gllug] Checkpoint/Restart
John Hearns
john.hearns at streamline-computing.com
Tue Feb 26 17:10:33 UTC 2008
On Tue, 2008-02-26 at 12:48 +0000, Mick Farmer wrote:
> Dear GLLUGers,
>
> One of our researchers needs to run a program for a long
> time (probably a week or more), so I thought it might be
> useful if we provided a checkpoint/restart facility.
We don't deal with system level checkpointing,
more at the application level.
The particular MPI implementation we use (www.pccluster.org) has very
good checkpointing facilities, and we can checkpoint/restart parallel
jobs. I'd guess this is no use to you really.
Here's a HOWTO on checkpointing with Sun Gridengine - only if you read
it it is really about methods to signal to your job to do its own
checkpointing (ie Gridengine does not have inbuilt checkpointing).
http://gridengine.sunsource.net/howto/checkpointing.html
It mentions using the Condor libraries, though that is something I've
never done.http://www.cs.wisc.edu/condor/checkpointing.html
--
Gllug mailing list - Gllug at gllug.org.uk
http://lists.gllug.org.uk/mailman/listinfo/gllug
More information about the GLLUG
mailing list