Hi all,
I have a simulation at hand that looks like it will take ~8 hours on the cluster machine I have access to, yet I can run simulations for only 6 hours in one job submission. Using a bash script, I have managed to create a checkpoint file for the first leg of the simulation. However, at the moment I am not sure how to make the second leg start from where the checkpoint file was created. Does this just happen automatically when I run the same simulation script? Here is my code:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24
#SBATCH --mem=60000
#SBATCH --time=6:00:00
#SBATCH --job-name=position1
#SBATCH --account=free
#SBATCH --partition=batch-sky
#SBATCH --mail-user=dz322@bath.ac.uk
#SBATCH --output=StdOut.o.%j
#SBATCH --error=StdErr.e.%j
module purge
module load slurm
module load matlab
module load hdf5/intel
module load intel/compiler/64/18.5.274
module load intel/mkl/64/18.5.274
module load fftw3/intel/avx/3.3.4
module load gcc/9.2.0
export OMP_NUM_THREADS=24
export OMP_PLACES=cores
export OMP_PROC_BIND=true
../kwave/kspaceFirstOrder-OMP/skylake/kspaceFirstOrder-OMP -i position1.h5 -o pos1_out_sky_2.h5 --checkpoint_file check_pos5_sky --checkpoint_interval 20000
The manual is talking about putting a loop in the bash script, yet I am unsure as to how I can implement this. Any help to do with this would be greatly appreciated. Many thanks!
Best wishes,
Dogu Zaifoglu
PhD candidate in Mechanical Engineering
MEng, University of Bath