SLURM-on-demand

General information

A SLURM instance started using the manage_slurm tool (a SLURM-on-demand inastance) runs on machine(s) within the BioHPC Cloud. Therefore, all software and tools available in the Cloud also available to any SLURM job. For list of availbale software and information on how to run it, refer to the BioHPC Cloud software page.

To use SLURM on a machine where it is running, the user must have active reservation on this machine, or, in case of hosted servers - be a member of the group authorized to use the server.

Once activated, the SLURM job scheduler helps streamline and prioritize jobs by multiple users, at the same time shielding the machine from CPU and memory oversubscription. All computationaly intensive activities should be run through SLURM.

Storage (recap of BioHPC storage policies)

Several tiers of storage are available to the users of BioHPC Cloud:

Home directories

Home directories (the /home file system) are located on BioHPC Cloud networked storage. Please do not, under any circumstances, run any computations which repeatedly write or read large amounts of data to/from your home directory. Home directory storage space is limited; the limit can inreased for a fee.

Local scratch space

Local scratch directory is mounted as /local/workdir (also available as /workdir ) and this is the space that should be used for computing, i.e., for storing input, output, and intermediate files for running jobs. Besides /workdir, the /local file system also hosts a directory /local/storage, intended to contain more permanent (i.e., non-scratch) data, as well as some smaller directories. The capacity of /local (machine-dependent) is shared between all users and jobs on the first-come first-serve basis. SLURM does not control the scratch space, and there is no automatic clean-up after jobs (as long as reservation is active). It is therefore important that users keep this space clean, in particular - by frequently removing any files they no longer need.

Since your SLURM jobs may not be the only ones using the scratch space, it is good practice to create a subdirectory within the scratch space for your job's files. A good convention is to make your scfript create a scratch sub-directory for each job, named after your username and job ID, e.g.,

mkdir -p /workdir/$USER/$SLURM_JOB_ID

and use this subdirectory to store all job's files. It is also good practice to delete your scratch files after you are done with them. Prefereably, this should be done at the end of the script, after all results have been saved to permanent storage (such as the home directory).

On hosted servers, the directory /local/storage is exported from via NFS and can be mounted on any other BioHPC machine using the utility /programs/bin/labutils/mount_server as follows

programs/bin/labutils/mount_server mysrvr /storage

(replace mysrvr with the name of your actual machine). This will make /local/storage from mysrvr availbale on the machine where the command was issues under /fs/mysrvr/storage. The command can be executed by any user. As it is the case with /home storage, the mounts under /fs are accessed over the network and must not be used directly by jobs heavy in I/O.

Scheduler - general considerations

SLURM Workload Manager (SLURM = Simple Linux Utility for Resource Management) provides tools for job submission and control.

Detailed documentation for SLURM can be found at https://slurm.schedmd.com/overview.html and multiple links there. In particular, https://slurm.schedmd.com/documentation.html contains thorough explanation of various aspects of SLURM from both the users and admin perspective, and https://slurm.schedmd.com/man_index.html provides detailed description and syntax of all SLURM commands, while a short, two-page summary is available at https://slurm.schedmd.com/pdfs/summary.pdf.

The official documentation is thorough but complicated and the structuring of it will make you dizzy. This document is intended to be a succinct extract relevant to SLURM-on-demand users.

Each job should be submitted to SLURM with specific requests of slots (number of threads to be allocated), RAM memory, and (optionally) maximum execution time. Based on these requests, SLURM will launch the jobat the appropriate time, when resources become available. Until then, the job will be waiting in queue for its turn to run. If not specified otherwise at job submission, one slot and 4 GB of RAM will be allocated with unlimited wall-clock time. The job will be confined to the requested number of slots regardless of how many threads it actually attempts to spawn. If the jobs attempts to allocate more RAM than granted by SLURM, it will most likely crash. The default 4 GB of RAM may be too small or too large - please know memory requirements of your jobs and adjust this requirement accordingly (see also 'How do I know memory needs of my job' later on in this document).

The SLURM configuration features a single queue, referred to in SLURM jargon as partition, called regular, to which jobs are submitted. There are no CPU or RAM constrants on jobs submitted to this partition other than the actual physical limits of the machine.

In the case of multiple users and jobs competing for cluster resources at the same time, job priorities are decided using a "fair share" policy. This means that the scheduler will consider current and past usage by each user in an effort to give all users fair access to the cluster.

Submitting batch jobs

SLURM offers several mechanism for job submission. Perhaps the most popular one is batch job submission. Create a shell script, say submit.sh, containing commands involved in executing your job. In the header of the script, you can specify various options to inform SLURM of the resources the job will need and how it should be treated. The SLURM submission options are preceded by the keyword #SBATCH. There are plenty of options to use. The example script header below contains the most useful ones:

#!/bin/bash -l                   (change the default shell to bash; '-l' ensures your .bashrc will be sourced in, thus setting the login environment)
#SBATCH --nodes=1                (number of nodes, i.e., machines; will be 1 by default on single-node clusters)
#SBATCH --ntasks=8               (number of tasks; by default, 1 task=1 slot=1 thread)
#SBATCH --mem=8000               (request 8 GB of memory for this job; default is typically 1GB per job; here: 8)
#SBATCH --time=1-20:00:00        (wall-time limit for job; here: 1 day and 20 hours)
#SBATCH --partition=regular      (request partition(s) a job can run in; it defaults to 'regular')
#SBATCH --chdir=/home/bukowski/slurm   (start job in specified directory; default is the directory in which sbatch was invoked)
#SBATCH --job-name=jobname             (change name of job)
#SBATCH --output=jobname.out.%j  (write stdout+stderr to this file; %j willbe replaced by job ID)
#SBATCH --mail-user=email@address.com          (set your email address)
#SBATCH --mail-type=ALL          (send email at job start, end or crash - do not use if this is going to generate thousands of e-mails!)

NOTE: You have to delete the explanation of each command falling inside the parentheses.

Once the script is ready, cd to the directory where it is located and execute the sbatch command to submit the job:

sbatch submit.sh

Instead of/in addition to the script header, submission options may also be specified directly on the sbatch command line, e.g.,

sbatch --job-name=somename --nodes=1 --ntasks=6 --mem=4000 submit.sh

If an option is specified both in the command line and in the script header, specification from the command line takes precedence.

Important notes about SBATCH options:

Shorthand notation: Most (although not all) options can be specified (in both the script header and command line) using short-hand notation. For example, options given in the header of the script above could be requested as follows:

sbatch -N 1 -n 8 --mem=8000 -t 1-20:00:00 -p regular -D /home/bukowski/slurm -J jobname -o jobname.out.%j --mail-user=email@address.com --mail-type=ALL

Startup and output directories: directories specified by the --chdir and --output options (or - if these are not specified - the directory where sbatch command was run) must be mounted and visible on the node(s) where the job runs. Avoid using directories network-mounted from compute nodes (such as, for example, /fs/cbsubscb09/storage), since these are not guaranteed to be available in the very beginning of a job (even if the /programs/bin/labutils/mount_server command is issued later in the script). The /home file system is always mounted on all compute nodes, so it is a good idea to have --chdir and --output point to somewhere within your home directory (or submit your job from there). Remember though, that all input and output files accessed by the job must be located within the /workdir (and/or /SSD) directory, local to each node.

Inheriting user's environment: If you want the job to run in the same environment as your login session, make sure that the first line of the script invokes the shell with the -l (login) option (#!/bin/bash -l). This will make the script read your .bashrc file and properly set up all environment variables. This may be necessary if your jobs requires a lot of environment customization, typically done through .bashrc.

Parallel jobs: If the program you are about to submit can spawn multiple processes or threads and you intend to use N>1 such processes or threads, request that many slots for your job using option --ntasks N. Different programs are parallelized in different ways - one of them involves a library called OpenMP. To enable OpenMP multithreading, insert the following line somewhere in the beginning of your script (substitute an actual number for N):

export OMP_NUM_THREADS=N

If you are not sure whether your program actually uses OpenMP for parallelization - insert that line anyway. However, beware of programs that use both types of parallelization simultaneously, i.e., created multiple processes - each of them multithreaded with OpnMP.

Number of nodes: if oyur SLURM cluster consists on only one machine (node), this, by default, will equal to 1. In general, on a multi-node cluster, if the program you are running does not use MPI (most software used at BioHPC is in this category), you should specify the option --nodes=1 (or -N 1). This will ensure that all threads spawned by that program will run on the same machine chosen automatically by SLURM. If you prefer a specific machine, you may request it using the --nodelist option, e.g., --nodelist=cbsubscb12 will make sure your job is sent to cbsubscb12. Only one machine can be specified - otherwise the request would be inconsistent with -N 1 and an error would occur.

If your program does use MPI so that its threads can be spread out across multiple nodes, you can skip the --nodes option altogether to let SLURM decide on how many and on which nodes to run your job. Or you can explicitly request the number of nodes, e.g., --nodes=2 will spread your MPI processes over two nodes selected by SLURM. If you want specific nodes, request them using --nodelist option, e.g., --nodelist=cbsubscb11,cbsubscb12 will make sure that your job uses both these nodes (plus possibly some other ones if needed to satisfy the number of tasks requested via --ntasks).

Other options:

Project to charge a job to (in SLURM this is referred to as 'account'): in SLURM it is possible to define different projects users may belonging to in order to allocate charges for the use of the machine. This is currently not implemented on SLURM instances started by manage_slurm.

Excluding nodes: If you do not want your job to end up running on certain machines, you can exclude such machines using the --exclude (or -x) option. For example, --exclude=cbsubscb10,cbsubscbgpu01 will prevent your job from being sent to any of the two specified nodes.

How do I know memory needs of my job?

If you do not know the answer based on studying the program's manual and related publications, you can run one or more jobs with increasing amounts of requested memory and record the job ID of the first job which did not crash (while the job is running, the job ID can be found in the first column of the output of the squeue command). After this job finishes, run the command

sacct -j <jobID> --format=JobID,User,ReqMem,MaxRSS,MaxVMSize,NCPUS,Start,TotalCPU,UserCPU,Elapsed,State%20 -j <jobID>

sacct_l -j <jobID>

and look for the value printed in the MaxRSS column - this is the maximum memory actually consumed by the job. Request a slightly larger value for the subsequent jobs running the same code with similar input data and parameters.

If the number in the MaxRSS column is very close to memory requested at submisison, chances are the job is actually trying to allocate more physcial memory using swap space on disk. The resulting frequent communication with swap (referred to as thrashing) slows the job down and may manifest itself in UserCPU (time spent on the actulat work) being much smaller than TotalCPU (product of elapsed time and the number of slots reserved). If this happens, you may find it useful to limit the amont of virtual memory a job can allocate to the physcial memory request by including the following statement in the beginning of the submitted script (but after all the #SBATCH header lines, if present):

ulimit -v $(ulimit -m)

Since the physical memory used cannot be larger than allocated virtual memory, the statement above will prevent the job from 'spilling over' to swap and cause it to crash (rather than thrash) if it attemps to exceed physical memory requested at submission, giving you an immediate signal that more such memory is needed. Note however, that memory needs established this way may be overblown, since most programs allocate more virtual memory than physical RAM they ever use. Moreover, restricting virtual memory may have detrimental effect on performance. Thus, the true memory need of your job should be found by examining MaxRSS and once known and requested through the --mem option, the ulimit statement above can be removed.

To learn how much of the requested resources your jobs actually consume, run the command get_slurm_usage.pl on one of the cluster nodes. For example,

get_slurm_usage.pl mysrvr 01/20/20 3

will produce per job averages (along with standard deviations) of requested and actually used number of slots and memory, computed over all your jobs longer than 1 minute which completed between Jan 18 (00:00) and Jan 21 (00:00) 2020. Of particular interest here would be the discrepancies between the requested memory (avReqMem) and memory that has actually been used (avUsedMem). If the latter number is much smaller than the former, your have been requesting too much memory for your jobs (thus making it unavailable to other jobs).

Alternative specifications of the number of slots and RAM

In SLURM, the option --ntasks (or -n in shorthand syntax) denotes the number of tasks. By default, each task is assigned one thread and therefore this is also equal to the number of threads. The default setting may, however, be modified to request multiple threads per task instead of one. For example,

sbatch --ntasks=8 --cpus-per-task=3 [other options]

will request 8*3=24 slots in total. From the point of view of most (if not all) jobs, this would just be equivalent to --ntasks=24.

Similarly, instead of requesting the total job memory via the --mem option, one could specify certain amount of memory per thread (cpu), using --mem-per-cpu option. For example,

sbatch -n 8 --cpus-per-task=3 --mem-per-cpu=2G [other options]

will request 8*3*2G = 48 GB of RAM for the job. For most jobs, this would be equivalent to the command

sbatch -n24 --mem=48G [other options]

Requesting GPUs (available only on clusters containing GPU-equipped servers)

To check whether GPUs are available on any machines in your cluster, run the following command on any of the cluster nodes:

scontrol show nodes | grep Gres | grep gpu

GPU(s) are availbale if the output contains one or more lines similar to

Gres=gpu:tP100:2

In this example, there are two GPUs of type tP100 installed on one of the servers (the types of the GPU are generally different on different the servers). To request one of these GPUs for your SLURM job, add

--gres=gpu:tP100:1

to your sbatch options. To grab both GPUs, use

--gres=gpu:tP100:2

The comma-delimted list of IDs of the GPUs reserved for the job will be availbale (within the job script) as $CUDA_VISIBLE_DEVICES. Of course, for this to make sense, your code needs to be GPU-aware (i.e., written and compiled to use GPUs).

Environment variables available to SLURM jobs

For the complete list of environment variables provided by SLURM, see section 'OUTPUT ENVIRONMENT VARIABLES' of https://slurm.schedmd.com/sbatch.html. Here we quote only the ones that seem most important:

SLURM_JOB_CPUS_PER_NODE : number of CPUs (threads) allocated to this job

SLURM_NTASKS : number of tasks, or slots, for this job (as given by --ntasks option)

SLURM_MEM_PER_NODE : memory requested with --mem option

SLURM_CPUS_ON_NODE : total number of CPUs on the node (not only the allocated ones)

SLURM_JOB_ID : job ID of this job; may be used, for example, to name a scratch directory (subdirectory of /workdir, or output files) for the job. For array jobs, each array element will have a separate SLURM_JOB_ID

SLURM_ARRAY_JOB_ID : job ID of the array master job (see section 'Array jobs' later in tis document)

SLURM_ARRAY_TASK_ID : task index of a task within a job array

SLURM_ARRAY_TASK_MIN, SLURM_ARRAY_TASK_MAX : minimum and maximum index of jobs within the array

Monitoring/controlling jobs and queues

Useful commands (to be executed on any cluster node or on one of the login nodes):

sinfo : report the overall state of the cluster and queues

scontrol show nodes : report detailed information about the cluster nodes, including current usage

scontrol show partitions : report detailed information about the queues (partitions)

squeue : show jobs running and waiting in queues

squeue -u abc123 : show jobs belonging to user abc123

squeue_l, squeue_l -u abc123: versions of the above giving some more information

scancel 1564 : cancel job with jobID 1564. All processes associated with the job will be killed

slurm_stat.pl mysrvr: summarize current usage of nodes, partitions, and slots, and number of jobs per user (run on one of the cluster nodes)

get_slurm_usage.pl: generate information about average duration, CPU, and memory usage of your recent jobs (run the command without arguments to see usage) - this may help assess real memory needs of your jobs and show whether all requested CPUs are actually used.

The node on which a job is running writes STDOUT and STDERR (screen output) generated by the job script to file(s) given by the --output option. Unless specified otherwise, these files will show up in the directory from which the job was started during run time. This means that you can monitor the progress of jobs by watching these files (of course, if output from individual commands run from within the script is redirected to other files, it will not show up as screen output from the job script).

Running interactive jobs

An interactive shell can be requested using the srun command with the --pty bash option. This command accepts the same submission options as sbatch. In particular, the partitions, requested number of slots, and amount of RAM should be specified, e.g.:

srun -n 2 -N1 --mem=8G -p short --pty bash -l

By default, you get 1 slot and 1 GB of RAM in regular partition. The srun command will wait until the requested resources become available and an interactive shell can be opened. Option -l ensures your .bashrc script will be executed (i.e., your standard login environment will be set in the interactve session).The wait time can be limited, e.g., adding the option --immediate=60 will cause the command to quit if the allocation cannot be obtained within 60 seconds.

Running interactive jobs using SCREEN (here we assume you are familiar with the screen program)

A SCREEN session can be opened on the cluster by submitting (using sbatch) a special script:

sbatch -N 1 <other_options> /programs/bin/slurm_screen.sh

Once started, the job will create a SCREEN session on one of the nodes. This session will be subject to all CPU, memory, and timing restrictions imposed by the partition and any sbatch options you specify. Use squeue to find the node where the job has started and ssh to that node. Once on the node, use screen -ls to verify the SCREEN session has been started there. Attach to this session using screen -r. Within the session, you can start any number of shells you need and run any processes you need. All the processes together will share the CPUs and memory specified at job submision. You can detach from the SCREEN session at any time and log out of the node - the session will keep running as long as the job is not terminated. When the SLURM job hosting you SCREEN sesison terminates (because it exceeds the time limit or is canceled), the session and all processes within it will also be terminated. However, if you terminate the SCREEN session in any other way (e.g., by closing all the shells or sending the quit signal), the corresponding SLURM job will not end automatically - you will need to cancel it 'manually' using scancel command to free up the resources and stop being charged for them.

Array jobs

Array jobs can be submitted by adding the --array option to the sbatch command. For example,

sbatch --array=0-30 myscript.sh

will effectively submit 31 independent jobs, each with SLURM submission parameters specified in the header of myscript.sh (or via command line options). Each such job will have the environment variable SLURM_ARRAY_TASK_ID set to its own unique value between 0 and 30. This variable can be accessed from within myscript.sh and used to differentiate between the input/output files and/or parameters of different jobs in the array. The maximum index of an array job (i.e., the second number in the --array statement) is set to 10000.

An array job will be given a single ID, common to all elements of the array, available as the value of the environment variable SLURM_ARRAY_JOB_ID. Each individual job in the array will be assigned its own job ID unique within the cluster and available as the value of SLURM_JOB_ID. Individual jobs within the array will also be referred to by various tools (for example, in the output from squeue command) using a concatenation of SLURM_ARRAY_JOB_ID and the value of SLURM_ARRAY_TASK_ID, for example: 1801_20 will correspond to job 20 of the array job 1801.

The number of jobs of the array running simultaneously can be restricted to N using the %N construct in the --array option. For example, the command

sbatch --array=0-30%4 myscript.sh

will submit a 31-element array of jobs, but only 4 of them will be allowed to run simultaneously even if there are unused resources on the cluster.

Before you submit a large number of jobs (as a job array or individually), make sure that each of them is long enough to be treated as a single job. Remember there is a time overhead associated with SLURM handling of each job (it needs to be registered, scheduled, registered as completed, etc). While it is not completely clear how large this overhead is, it is a good practice to avoid submitting multiple jobs shorter than a couple of minutes. If you do have a lot of such short tasks to process, bundle them together, so that a single SLURM job runs a few of them in a loop, one after another.

Docker containers in SLURM jobs

Docker containers can be started from within a SLURM job using the docker1 command. However, such containers will not automatically obey the CPU, memory, or time allocations granted to a job by SLURM. Therefore, these limits have to be imposed explicitly on the container. For example, if a job is submitted with SLURM options --mem=42G -n 4, these restrictions must be passed on to the docker container via the docker1 command as follows:

docker1 run --memory="40g" --cpus=4 <image_name> <command>

(note that the memory made available to the conatiner should be somewhat smaller than the amount requested from SLURM). If neither of the --mem or -n (same as --ntasks) SLURM options are explicitly specified at job submittion, the defaults are --mem=1G and -n 1, respectively, and these should be used in docker1 command above.

There is currently no tested way of imposing a time limit on docker containers.

Other considerations

Backups

Unless arranged on an individual or per group basis, there are no default backups on BioHPC. Beware!

File permissions

The default file permissions are very restrictive. To make your files group-readable by default, do the following:

echo "umask 022" >> ~/.bashrc

To fix permissions on already-existing files, do something like this:

chown -R username:bscb07 $HOME
find $HOME -type f -exec chmod 644 {} \;
find $HOME -type d -exec chmod 755 {} \;

Passwordless ssh

As always, you can set up ssh keys to log into the cluster machines without a password. From the login machine, do the following:

cd
ssh-keygen -t rsa            # press enter a few times to skip over questions
cat .ssh/id_rsa.pub >> .ssh/authorized_keys
echo Host \* >> .ssh/config
echo StrictHostKeyChecking no >> .ssh/config

chmod 700 .ssh
chmod 600 .ssh/authorized_keys .ssh/config

You can also append the contents of .ssh/id_rsa.pub to .ssh/authorized_keys on other machines (and vice-versa) to authenticate logins between the cluster and your other workstations.

Software

All the software installed on BioHPC Cloud nodes is available on the cluster, complete list of installed software with versions and other information is on BioHPC Cloud software page. If you need something else installed, you can install it in a local directory, or put in a request to support@biohpc.cornell.edu.

In case of questions or problems

Reports of technical problems and requests related to your SLURM instance should be sent to support@biohpc.cornell.edu, which will open a trackable service ticket.

Contents: