The art of restarting27 April 2016
Limits on runtimes
Running jobs on a supercomputer can sometimes involve letting the job run for a long time. A long time in this instance is longer than the default expected maximum runtime for a particular type of job. At Cardiff we have 2 types of jobs which we have setup our supercomputer for.
- Serial job – fairly low resource requirement (maximum of 1 compute node). We expect jobs not to last longer than 5 days.
- Parallel job – higher CPU resource requirement (multiple compute nodes required. We expect jobs not to last longer than 3 days.
These 2 types of jobs are then catered for in our scheduler (PBS Pro) by using 2 different queues where users can submit jobs. However sometimes users have jobs that do not fit this default assumption of job runtime limits – so what can we do?
Special access queues
As a simple solution we provide some special queues with longer maximum runtime limits which allow jobs to run for longer without being killed. This is fine if only a few extra days are required and we can plan it around any maintenance sessions which require jobs to be stopped but what about very long jobs in the order of approaching a month.
Profile and optimise
If the code can be improved we can help researchers optimise the code. This allows the runtime to fall below the default runtime limits. This can be tricky and requires investment of time (so the jobs will need to be shown to be used alot after the optimisation). However this can sometimes still not be enough.
Dump the model state at regular intervals
The most flexible way of achieving a long run is to allow you job to be restarted (and keep reproducibility as if you did not restart). This allows the job to be killed but be rerun later when the resources are available again. This allows your job to survive a maintenance session where the supercomputer has to be shutdown completely. Writing this requirement in the actual software – rather than relying on the scheduler to do some form of dumping – makes your software more portable.
A researcher came to ARCCA and asked to run their Matlab code. It was estimated to take nearly 40 days to run (it was not parallel and was difficult to rewrite the code to take this into account). After taking a look at the code and working with the researcher we managed to do create dumps of the model state so it could be restarted and continued over a series of jobs.
Take for example the following original Matlab code.
for s=1:nruns; ...computational expensive code... end;
One quick way of achieving a dump is to save the state of all the Matlab variables but change the for loop to a while loop to allow continuation of the loop in an obvious fashion (e.g. setting the index in a for loop is bad practive).
# Name of dumpfile checkfile='checkfile.mat'; # Find out whether its possible to restart. restart = exist(fullfile(cd,checkfile),'file') == 2; s=0 while (s < nruns); s = s + 1 # Only dump/save state on each 10th loop iteration. if( mod(s,10) == 0); disp("Checkpointing Program"); save(checkfile); end; # Only restart is a previous dumpfile exists. if(restart); disp("Restarting Program"); load(checkfile); restart = false; end; ...computational expensive code... end;
However what happens if the “computational expensive code” uses random numbers. Matlab doesnt automatically save the state of the random number generator so getting reproducibility between runs of different length is not guaranteed. We require to save the random number generator state (using
rng) and to load this back when reloading the saved dump.
... # Only dump/save state on each 10th loop iteration. if( mod(s,10) == 0); disp("Checkpointing Program"); my_rng_state = rng; save(checkfile); end; # Only restart is a previous dumpfile exists. if(restart); disp("Restarting Program"); load(checkfile); restart = false; rng(my_rng_state); end; ...
It was possible to help a researcher to perform a large number of these calculations and to also allow the researcher to restart the job for any reason – even if they want to run the code on their desktop. However be careful with hidden states such as random number generators since they can sometimes cause some confusion when restarting.