Commands for managing your “jobs”: Memo
Contents
view
Information about a job
to report information about active or completed job.
sacct -j job-id
To submit a job
The script will typically contain one or more srun commands to launch parallel tasks.
sbatch script.slurm sbatch -x node037 my_script.sh -> submits by excluding a calculation node
To cancel a job
scancel job-id
Information about partitions and nodes
sinfo
To list free nodes
Partition which integrate nodes is mentionned
sinfo --states=idle
Node states
- mix : consumable resources partially allocated
- idle : available to requests consumable resources
- drain : unavailable for use per system administrator request
- drng : currently executing a job, but will not be allocated to additional jobs. The node will be changed to state DRAINED when the last job on it completes
- alloc : consumable resources fully allocated
- down : unavailable for use. Slurm can automatically place nodes in this state if some failure occurs.
State of your jobs
squeue --me
Job states
- BF BOOT_FAIL Job terminated due to launch failure.
- CA CANCELLED Job was explicitly cancelled.
- CD COMPLETED Job has terminated.
- CF CONFIGURING Job has been allocated resources, but are waiting for them to become ready for use.
- CG COMPLETING Job is in the process of completing.
- F FAILED Job terminated with error code.
- NF NODE_FAIL Job terminated due to failure of one or more allocated nodes.
- OOM OUT_OF_MEMORY Job experienced out of memory error.
- PD PENDING Job is awaiting resource allocation.
- PR PREEMPTED Job terminated due to preemption.
- R RUNNING Job currently has an allocation.
- RD RESV_DEL_HOLD Job is being held after requested reservation was deleted.
- RF REQUEUE_FED Job is being requeued by a federation.
- RH REQUEUE_HOLD Held job is being requeued.
- RQ REQUEUED Completing job is being requeued.
- RS RESIZING Job is about to change size.
- SI SIGNALING Job is being signaled.
- SE SPECIAL_EXIT The job was requeued in a special state.
- SO STAGE_OUT Job is staging out files.
- ST STOPPED Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job.
- S SUSPENDED Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.
- TO TIMEOUT Job terminated upon reaching its time limit.
Job in real time
To submit a job in real time. srun has a wide variety of options.
srun command with parameters