Submitting jobs with Slurm¶

Resource sharing on a high-performance cluster dedicated to scientific computing is organized by a piece of software called a resource manager or job scheduler. Indeed, users do not run software on those cluster like they would do on a workstation, rather, they must submit jobs, which are run unattended, by the job scheduler at the time, and on the resources, decided by its algorithm.

Slurm is a resource manager and job scheduler designed to do just that, and much more. It was originally created by people at the Livermore Computing Center, and has grown into a full-fledge open-source software backed up by a large community, commercially supported by the original developers, and installed in many of the Top500 supercomputers.

Anatomy of a cluster¶

A high-performance compute cluster (HPC cluster) is a made of a set of powerful computers, called nodes in this context, linked together through a fast network. Multiple format factor exist, but one common one is the “pizza box” shape, depicted in the image below showing multiple such servers stacked in a data center rack.

A cluster typically hosts multiple types of nodes:

management nodes: these computers run the management services like Slurm, but also the user identities database, the monitoring and reporting applications, etc. ;
login nodes: these are the computers users login into through SSH to install software and submit jobs ;
storage nodes: these computers host user files in possibly multiple filesystems ; and
compute nodes: these are the computers where jobs run and where the computations actually take place. They are often organised into partitions

Alongside these most common types, a cluster can also comprise

visualization nodes: computers dedicated to data visualization with specific hardware ;
I/O nodes: computers dedicated to data ingress and outgress to and from the cluster.

The typical usage therefore involves connecting to a login node, making sure data are stored on the right filesystems and software are available, and then submitting jobs. The jobs are dispatched to the compute nodes by the resource manager based on the adequation between the resources resquested by the job and those offered by the compute nodes.

Warning

On a cluster, you do not run the computations on the login node. Computations belong on the compute nodes, when, and where decided by the scheduler.

Inside a compute node, the computation is carried on by the processors. A processor looks like this:

A typical compute node has one, two, or four sockets on the motherboard, that each can host a processor. The processors are made of multiple cores, that can be though of as independent computing units, each one being in turn possibly “divided” into two hardware threads.

Hardware threads are a technology designed to hide the latencies of the memory and feed the compute units fast enough to keep them busy all the time. It can be activated or deactivated depending on the cluster. Depending on the workload, it can benefit or slightly harm the computations times.

Compute nodes also provide volatile working memory, or RAM, where data are stored when they are being processed or produced, before they are enventually saved to permanent storage (disks).

Compute nodes can also offer accelerators, such as GPUs, which are pieces of hardware where heavy computations can be offloaded. GPUs are often much more powerfull than regular processors, but do not have the same flexibility and require specific programming skills.

Gathering cluster information¶

Slurm offers the sinfo command to get an overview of the resources offered by the cluster. By default, sinfo lists the partitions that are available. A partition is a set of compute nodes (computers dedicated to... computing) grouped logically based on either physical properties of the hardware or job scheduling policies. Typical examples include partitions dedicated to debugging where only small and short jobs can be scheduled, or partitions dedicated to visualization with nodes equipped with specific graphic cards.

# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE  NODELIST
batch     up     infinite     2 alloc  node[08-09]
batch     up     infinite     6 idle   node[10-16]
debug*    up        30:00     8 idle   node[01-07]

In the above example, we see two partitions, named batch and debug. The latter is the default partition as it is marked with an asterisk. All nodes of the debug partition are idle, while two of the batch partition are being used. The nodes in this example are named node01 to node16.

The sinfo command also lists the time limit (column TIMELIMIT) to which jobs are subject. On every cluster, jobs are limited to a maximum run time, to allow job rotation and let every user a chance to see their job being started. Generally, the larger the cluster, the smaller the maximum allowed time.

If the output of the sinfo command is organised differently from the above, it probably means default options are set through environment variables or wrappers are used at your site to provide more suitable defaults. Or you could have a different Slurm version.

The command sinfo can output the information in a node-oriented fashion, with the argument -N.

# sinfo -N -l
NODELIST        NODES PARTITION     STATE  CPUS  S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
node[01-02]         2    debug*      idle  32    2:8:2   3448    38536     16 Intel   (null)
node[03,05-07]      4    debug*      idle  32    2:8:2   3384    38536     16 Intel   (null)
node03              1    debug*      down  32    2:8:2   3394    38536     16 Intel   "Disk replacement"
node[08-09]         2     batch allocated  32    2:8:2    246    82306     16 AMD     (null)
node[10-16]         7     batch      idle  32    2:8:2    246    82306     16 AMD     (null)

With the -l argument, more information about the nodes is provided, among which the number of “CPUs” (CPUS), which is the number of processing units that the jobs can use. It should generally correspond to the number of sockets (S) times number of cores per socket (C) times number of hardware threads per core (T in the S:C:T column) but can be lower in the case some CPUs are reserved for system use.

The other columns report the volatile working memory (RAM – MEMORY), the size of the local temporary disk (also called local scratch space – TMP_DISK), and the node “weight” (an internal parameter specifying preferences in nodes for allocations when there are multiple possibilities). The last but one column (AVAIL_FE) show so-called features of the nodes, that are set by the administrator, and can refer to a processor vendor or family, a specific network equipment, or any desirable feature of the node, that can be used to choose one node type to another.

The last column, (REASON), if not null, describes the reason why a node would not be available.

Note

You can actually specify precisely what information you would like sinfo to output by using its --format argument. For more details, have a look at the command manpage with man sinfo.

Anatomy of a job¶

Jobs are made of one or multiple sequential steps, each consisting in one or multiple parallel tasks that will be dispatched to possibly distinct nodes.

Each task will be allocated CPUs, memory, and possible other generic resources in an exclusive manner by Slurm. Two jobs cannot share the same resource unless explicitly forced by the admins, but that is gnerally not the case. Therefore, jobs can only start when all needed resources are free and not needed by another higher priority job. Jobs are indeed assigned a priority when they are submitted, which can depend upon multiple factors. More on that later.

For the scheduling process to work properly, you will need to describe your job before you submit it:

what are the steps (i.e. which program must be run and how) ;
how many tasks there will be ;
what resource each task needs (CPU, memory, etc.), and
for how long the job is supposed to run.

All of these, along with potentially aditionnal job parameters submission options, can be described in a submission script.

A submission script is a shell script, e.g. a Bash script, whose comments, if they are prefixed with #SBATCH, are understood by Slurm as parameters describing resource requests and other submissions options. You can get the complete list of parameters from the sbatch manpage man sbatch.

Important

The #SBATCH directives must appear at the top of the submission file, before any other line except for the very first line which should be the shebang (e.g. #!/bin/bash).

The script itself is a job step. Other job steps are created with the srun command, that takes as argument the name and options of the program to run.

Note

If there is only one step, the srun command can be omitted, but it has consequences in how signals are handled and how accurate reporting is. It is often best to use it in all cases.

For instance, the following script, hypothetically named submit.sh,

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=res.txt
#SBATCH --partition=debug
#
#SBATCH --time=10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=100

srun hostname
srun sleep 60

describes a job made of 2+1 steps (the submission script itself plus two explicit steps created by calling srun twice), each step consisting of only one task that needs one CPU and 100MB of RAM. The first step will run the hostname command, and the second one the useless sleep command. The job is supposed to run for 10 minutes on the debug partition, be named test, and create an output file named res.txt.

Important

It is important to note that as the job will run unattended, it will not be attached to a terminal (screen) so everything that the program that is run would like to write to the terminal will be redirected by Slurm to a file, whose name is specified by the --output parameter. (Note that for the same reasons, the program cannot expect input from the user through the keyboard to run properly.)

Once the submission script is written properly, you need to submit it to slurm through the sbatch command, which, upon success, responds with the jobid attributed to the job. (The dollar sign below is the shell prompt)

$ sbatch submit.sh
sbatch: Submitted batch job 12321

Warning

Make sure to submit the job with sbatch and not bash; also do not execute it directly. This would ignore all resource request and your job would run with minimal resources, or could possible run on the frontend rather than on a compute node.

The job then enters the queue in the PENDING state. You can verify this with

$ squeue --me

Once resources become available and the job has highest priority, an allocation is created for it and it goes to the RUNNING state. If the job completes correctly, it goes to the COMPLETED state, otherwise, it is set to the FAILED state.

Why is my job not starting?¶

A job can stay in the PENDING state for a long time, especially if the maximum allowed time on the cluster is large. It is always important though to check the reason for which the job is pending. If the reason is Priority, then there is not much you can do, but there are other possible reasons, some of which require action on your part.

The reason is listed in the last column of the default output of squeue and the description of the reason codes is available from the documentation.

Gathering job information¶

The squeue command shows the list of jobs which are currently running (they are in the RUNNING state, noted as ‘R’) or waiting for resources (noted as ‘PD’, short for PENDING).

# squeue
JOBID PARTITION NAME USER ST  TIME  NODES NODELIST(REASON)
12345     debug job1 dave  R   0:21     4 node[09-12]
12346     debug job2 dave PD   0:00     8 (Resources)
12348     debug job3 ed   PD   0:00     4 (Priority)

The above output shows that one job is running, whose name is job1 and whose jobid is 12345. The jobid is a unique identifier that is used by many Slurm commands when actions must be taken about one particular job. For instance, to cancel job job1, you would use scancel 12345. Time is the time the job has been running until now. Node is the number of nodes which are allocated to the job, while the Nodelist column lists the nodes which have been allocated for running jobs. For pending jobs, that column gives the reason why the job is pending. In the example, job 12346 is pending because requested resources (CPUs, or other) are not available in sufficient amounts, while job 12348 is waiting for job 12346, whose priority is higher, to run. Each job is indeed assigned a priority depending on several parameters whose details are explained in section Slurm priorities. Note that the priority for pending jobs can be obtained with the sprio command. Other reasons can exist ; they are listed in the Slurm documentation.

There are many switches you can use to filter the output by user --user, by partition --partition by state --state etc. As with the sinfo command, you can choose what you want sprio to output with the --format parameter.

Interestingly, you can get near-realtime information about your job when it is running (memory consumption, etc.) with the sstat command, by running sstat -j jobid. You can select what you want sstat to output with the --format parameter. Refer to the manpage for more information man sstat.

Upon completion, the output file contains the result of the commands run in the script file. In the above example, you can see it with cat res.txt command.

The example shown above illustrates a serial job which runs a single CPU on a single node. It does not take advantage of multi-processor nodes or the multiple compute nodes available with a cluster, by contrast to the example jobs in the squeue output that use multiple nodes.

The next sections explain how to create parallel jobs.

Going parallel¶

There are several ways a parallel job, one that leverages multiple compute units (CPUs) at the same time, can be created.

Warning

The type of job that can be created depends on the capabilities of the software being used. Parallel computing requires specific programming techniques, which, if not used, lead to simply the same computation being performed multiple times, with no actual benefit.

If the resource request fit with the capabilities of the program, then Slurm will be able to allocate every parallel sequence of computing instructions, called a thread in programming, to a different CPU in a precise one-to-one mapping. This process is called “binding”, or “pinning”. If there are more threads created by the program than there are CPUs available, the operating system will iteratively start and stop threads to allocate CPU time to every thread, in a procedure called “context switching”. That procedure consumes resource in itself and incurs significant overhead to be considered undesirable for high-performance computing.

Single-node parallelism¶

In the Linux operating system, running a program will initiate a process, that is a running copy of the program in the main memory. A process initially consists of one single thread, but it can clone itself into multiple threads that all share the same memory space (shared-memory programming) and can perform computations in parallel, but it can also spawn (“fork” in Linux terms) other processes, whose memory spaces are independent and whith which it can communicate internally.

Multithreaded programms can be written using the pthreads system call, but most multithreaded scientific software is written with OpenMP. OpenMP allows creating Single program, Multiple data programs, or SPMD. The same program is run multiple times, but each instance is identified by a thread ID, or rank that can be used to differenciate the behavior of each instance, for instance working on a separate subset of the data. Optimised linear algebrae of signal processing libraries use OpenMP a lot for parallelism.

Note

Even if the software you write is purely sequential and does not use OpenMP for instance, if it links to optimized scientific libraries (as is the case for interpreted languages like Python, R or Julia, they will use OpenMP behind the scenes so you should request at least 4 or 8 CPUs.

Multi-node parallelism¶

Multi-node parallelism assumes distributed memory programming with either

no communication between the processes on the different nodes ; or
message passing through the network ; or
communication through a common filesystem.

The first case, called embarrassingly parallel, simply relies on multiple instances of the same program being started with either - a different environment, and possibly - different arguments.

The second case is typical of high-performance computing, and requires specific large-bandwith low-latency, network hardware. Most message passing scientific software use MPI, a library that takes care of instantiating multiple instances of the same program on different nodes, and allow them to send and revieve messages through the network. This is another example of SPMD programming.

The last case is mostly encountered with Master/Worker setups (a specific case of multiple program multiple data) where one program (master) is designed to prepare and distribute work, while another (worker) is designed to perform the work. Only one master is run, but there can be as many workers instances as needed.

Important

The above taxonomy must be take with a grain of salt. Some Master/Worker programs use the network rather than the disk for communication, or use interprocess communication. In that later case, they can only do single-node paralelism. As for Message Passing, it can be done all in a single node. Also shared memory programming can be done on multiple nodes if specific libraries are used.

Slurm tasks¶

In the Slurm context, a task is to be understood as a running instance of a program which matches the definition of a “process” seen earlier. Tasks for each step are spawned by using the srun command as was explained earlier.

Note

MPI programs can also be started by the mpirun command instead. The mpirun command is provided with the MPI library, while srun is provided by Slurm. There are pros and cons to using one versus another so it will bascally depends on the specific parameters that are needed for starting the program.

The number of tasks (processes) that srun must spawn is decided by the value of the --ntasks option. Each process will be allocated the number of CPUs dictated by the value of the --cpus-per-task option, and the memory obtained by the multiplication of the value of the --mem-per-cpu option by the number of CPUs allocated per task.

Tasks cannot be split across several compute nodes, so all CPUs requested with the --cpus-per-task option will be located on the same compute node. By contrast, CPUs allocated to distinct tasks can end up on distinct nodes.

Each task has an environment variable called SLURM_PROCID set by Slurm to a distinct value from 0 to the number of tasks minus one.

More submission script examples¶

Here are some quick sample submission scripts. For more detailed information, make sure to have a look at the Slurm FAQ and to follow our training sessions or to watch the training video. There is also an interactive Script Generation Wizard you can use to help you in submission scripts creation.

Message passing example (MPI)¶

#!/bin/bash
#
#SBATCH --job-name=test_mpi
#SBATCH --output=res_mpi.txt
#
#SBATCH --ntasks=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100

module load OpenMPI
srun hello.mpi

Request four cores on the cluster for 10 minutes, using 100 MB of RAM per core. Assuming hello.mpi was compiled with MPI support, srun will create four instances of it, on the nodes allocated by Slurm.

You can try the above example by downloading the example hello world program from Wikipedia (name it for instance wiki_mpi_example.c), and compiling it with

module load OpenMPI
mpicc wiki_mpi_example.c -o hello.mpi

The res_mpi.txt file should contain something like

We have 4 processors
Hello 1! Processor 1 reporting for duty
Hello 2! Processor 2 reporting for duty
Hello 3! Processor 3 reporting for duty

Shared memory example (OpenMP)¶

#!/bin/bash
#
#SBATCH --job-name=test_omp
#SBATCH --output=res_omp.txt
#
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./hello.omp

The job will be run in an allocation where four cores have been reserved on the same compute node.

You can try it by using the hello world program from Wikipedia (name it for instance wiki_omp_example.c) and compiling it with

gcc -fopenmp wiki_omp_example.c -o hello.omp

The res_omp.txt file should contain something like

Hello World from thread 0
Hello World from thread 3
Hello World from thread 1
Hello World from thread 2
There are 4 threads

Embarrassingly parallel workload example (job array)¶

This setup is useful for problems based on random draws (e.g. Monte-Carlo simulations). In such cases, you can have four programs drawing 1000 random samples and combining their output afterwards (with another program) you get the equivalent of drawing 4000 samples.

Another typical use of this setting is parameter sweep. In this case the same computation is carried on several times by a given code, differing only in the initial value of some high-level parameter for each run. An example could be the optimisation of an integer-valued parameter through range scanning in a job array:

#!/bin/bash
#
#SBATCH --job-name=test_emb_arr
#SBATCH --output=res_emb_arr.txt
#
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100
#
#SBATCH --array=1-8

srun ./my_program.exe $SLURM_ARRAY_TASK_ID

In that configuration, the command my_program.exe will be run eight times, creating eight distinct jobs, each time with a different argument passed with the environment variable defined by slurm SLURM_ARRAY_TASK_ID ranging from 1 to 8, as specified by the --array parameter.

The same idea can be used to process several data files. To different instances of the program we must pass a different file to read, based upon the value set in the $SLURM_* environment variable. For instance, assuming there are exactly eight files in /path/to/data we can create the following script:

#!/bin/bash
#
#SBATCH --job-name=test_emb_arr
#SBATCH --output=res_emb_arr.txt
#
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100
#
#SBATCH --array=0-7

FILES=(/path/to/data/*)

srun ./my_program.exe ${FILES[$SLURM_ARRAY_TASK_ID]}

In this case, eight jobs will be submitted, each with a different filename given as an argument to my_program.exe defined in the array FILES[]. As the FILES[] Bash array is zero-indexed, the Slurm job array IDs must also start at 0 so the argument is --array=0-7. One pain point is that the number of files in the directory must match the number of jobs in the array.

Note that the same recipe can be used with a numerical argument that is not simply an integer sequence, by defining a Bash array ARGS[] containing the desired values:

ARGS=(0.05 0.25 0.5 1 2 5 100)

srun ./my_program.exe ${ARGS[$SLURM_ARRAY_TASK_ID]}

Here again, the Slurm job array numbering must start at 0 to make sure all items in the ARGS[] Bash array are processed.

Warning

If the running time of your program is small, say ten minutes or less, creating a job array will incur a lot of overhead and you should consider packing your jobs.

Packed jobs example¶

By default, the srun command in a submission script inherits all non-GRES resource allocated in the job, but with the --exact parameter, you can split the resource and allocate them to multiple steps in parallel.

As an example, the following job submission script will ask Slurm for 8 CPUs, then it will run the myprog program 1000 times with arguments passed from 1 to 1000. But with the -N1 -n1 -c1 --exact option, it will control that at any point in time only 8 instances are effectively running, each being allocated one CPU. You can at this point decide to allocate several CPUs or tasks by adapting the corresponding parameters.

#! /bin/bash
#
#SBATCH --ntasks=8
for i in {1..1000}
do
   srun -N1 -n1 -c1 --exact ./myprog $i &
done
wait

The for-loop can be replaced with GNU parallel if installed on your system:

parallel -P $SLURM_NTASKS srun -N1 -c1 -n1 --exact ./myprog ::: {1..1000}

Similarly, many files can be processed with one job submission script. The following script will run myprog for every file in /path/to/data, but maximum 8 at a time, and using one CPU per task.

#! /bin/bash
#
#SBATCH --ntasks=8
for file in /path/to/data/*
do
   srun -N1 -n1 -c1 --exact ./myprog $file &
done
wait

Here again the for-loop can be replaced with another command, xargs:

find /path/to/data -print0 | xargs -0 -n1 -P $SLURM_NTASKS srun -n1 --exclusive ./myprog

Master/worker program example¶

#!/bin/bash
#
#SBATCH --job-name=test_ms
#SBATCH --output=res_ms.txt
#
#SBATCH --ntasks=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100

srun --multi-prog multi.conf

With file multi.conf being, for example, as follows

0      echo I am the Master
1-3    echo I am worker %t

The above instructs Slurm to create four tasks (or processes), one running echo 'I am the Master', and the other 3 running echo I am worker %t. The %t placeholder will be replaced with the task id. This is typically used in a producer/consumer setup where one program (the master) create computing tasks for the other program (the workers) to perform.

Upon completion of the above job, file res_ms.txt will contain

I am worker 2
I am worker 3
I am worker 1
I am the Master

though not necessarily in the same order.

Hybrid jobs¶

You can mix multi-processing (MPI) and multi-threading (OpenMP) in the same job, simply like this:

#! /bin/bash
#
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=4
module load OpenMPI
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./myprog

or even a job array of hybrid jobs:

#! /bin/bash
#
#SBATCH --array=1-10
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=4
module load OpenMPI
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./myprog $SLURM_ARRAY_TASK_ID

GPU jobs¶

If you want to claim a GPU for your job, you need to specify the GRES Generic Resource Scheduling parameter in your job script. Please note that GPUs are only available in a specific partition whose name depends on the cluster.

#SBATCH --partition=PostP
#SBATCH --gres=gpu:1

A sample job file requesting a node with a GPU could look like this:

#!/bin/bash
#SBATCH --job-name=example
#SBATCH --ntasks=1
#SBATCH --time=1:00:00
#SBATCH --mem-per-cpu=1000
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1

module load CUDA

srun ./my_cuda_program

Interactive jobs¶

Slurm jobs are normally batch jobs in the sense that they are run unattended. If you want to have a direct view on your job, for tests or debugging, you have two options.

If you need simply to have an interactive Bash session on a compute node, with the same environment set as the batch jobs, run the following command:

srun --pty bash -l

Doing that, you are submitting a 1-CPU, default memory, default duration job that will return a Bash prompt when it starts.

If you need more flexibility, you will need to use the salloc command. The salloc command accepts the same parameters as sbatch as far as resource requirement are concerned. Once the allocation is granted, you can use srun the same way you would in a submission script.

Submitting jobs with Slurm¶

Anatomy of a cluster¶

Gathering cluster information¶

Anatomy of a job¶

Why is my job not starting?¶

Gathering job information¶

Going parallel¶

Single-node parallelism¶

Multi-node parallelism¶

Slurm tasks¶

More submission script examples¶

Message passing example (MPI)¶

Shared memory example (OpenMP)¶

Embarrassingly parallel workload example (job array)¶

Packed jobs example¶

Master/worker program example¶

Hybrid jobs¶

GPU jobs¶

Interactive jobs¶

Further readings¶