BSCB SLURM cluster at BioHPC

General information

The cluster consists of 16 compute nodes purchased by faculty of the BSCB department. All these nodes are part of the BioHPC Cloud, with all the software and tools available in the Cloud also available on the cluster nodes. General BioHPC Cloud information (connecting, file transfer, using storage, etc.) applies to the BSCB cluster as well - please check BioHPC Cloud home page and BioHPC Cloud User Guide. For list of availbale software and information on how to run it, refer to the BioHPC Cloud software page.

What distinguishes the BSCB cluster from other BioHPC Cloud resources is the way these resources are alloated to users. While the Cloud machines are reserved using a calendar scheduler, the resources of the BSCB cluster are managed by the SLURM job scheduler.

To use the cluster, each user must have the BioHPC Cloud account and belong to at least one of the following groups:

bscb01 (Adam Boyko Lab)
bscb02 (Andrew Clark Lab)
bscb03 (Jason Mezey Lab)
ak735_0001 (Alon Keinan Lab)
bscb09 (Amy Williams Lab)
bscb10 (Philipp Messer Lab)
hy299_0001 (Haiyuan Yu Lab)
danko_0001 (Charles Danko Lab)
nt246_0001 (Nina Therkildsen Lab)
aweilab (Xinzhu Wei Lab)
pennell_lab (Matt Pennell Lab)

If you don't have the BioHPC Cloud account, please contact us at support@biohpc.cornell.edu. Once the account is established, ask your PI to add you to the appropriate lab group.

Technical problems or requests related to the cluster should be sent to support@biohpc.cornell.edu, which will open a trackable service ticket.

Compute nodes

The BSCB SLURM cluster consists of the following nodes:

Machine	CPU cores (physical/multithreaded)	RAM [GB]	RAM/real core [GB]	Local storage [TB]	Scratch storage [TB] (/workdir)
cbsubscb01	32/64	256	6	5.6	2
cbsubscb02	32/64	256	6	5.6	in Local
cbsubscb03	32/64	256	6	5.6	2
cbsubscb04	32/64	256	8	5.6	2
cbsubscb05	32/64	512	16	5.6	2
cbsubscb06	32/64	256	8	5.6	2
cbsubscb07	32/64	256	8	5.6	2
cbsubscb08	32/64	256	8	15	in Local
cbsubscb09	32/64	512	16	17	2
cbsubscb10	32/64	512	16	6	2
cbsubscb11	32/64	256	8	15	1
cbsubscb12	32/64	256	8	15	1
cbsubscb13	56/112 (AVX2)	512	9	10	1
cbsubscb14	56/112 (AVX2)	512	9	20	2
cbsubscb15	56/112 (AVX2)	512	9	26	2
cbsubscb16	64/128(AVX2)	1003	15.7	146	6
cbsubscb17	64/128(AVX2)	1003	15.7	146	6
cbsubscb18	32/64(AVX2)	500	15.7	146	6
cbsubscb19	128/256(AVX2)	2000	15.7	219	10
cbsubscb20	64/128(AVX2)	1003	15.7	14	14
cbsubscbgpu01	16/32 (AVX2)	512	32 + 2 GPUs	15	2

The CPUs in nodes marked "AVX2" feature the AVX2 instruction set required by a lot of modern programs. In addition to the compute nodes listed above, three login nodes are also available and can be used for job submission and access to home directories for file browsing or file transfer. These are cbsulogin, cbsulogin2, and cbsulogin3. All BioHPC Cloud machines are in biohpc.cornell.edu domain, so that the fully qualified name of, say, cbsubscb13 is cbsubscb13.biohpc.cornell.edu.

Additionally, three storage servers (cbsubscbfs1, cbsubscbfs2, and cbsubscbfs3) provide the total capacity of 281 TB of permanent storage in a Gluster file system network-mounted on all compute nodes as /bscb. These servers are just hosting storage, they cannot be accessed directly or used for any computations.

The hardware details for all nodes are given at the end of this document.

All compute nodes are available for direct ssh access (from within the Cornell network) to all cluster users to facilitate easier job monitoring (e.g., look at files in local storage, monitor CPU utilization with top, etc.). However, this ssh privilege must not be used to run any computationally intensive tasks outside of the scheduler. If detected, such tasks will be immediately killed.

Storage

Several tiers of storage are available to the users of the BSCB cluster.

Home directories

Home directories (the /home file system) are located on BioHPC Cloud networked storage. Please do not, under any circumstances, run any computations which repeatedly write or read large amounts of data to/from your home directory. Home directory storage space is limited, and should not be used for permanent storage, unless your Lab rents CBSU storage.

Gluster storage (/bscb)

281 TB of permanent storage space hosted on the three dedicated file servers (cbsubscbfs1, cbsubscbfs2, and cbsubscbfs3) and available network-mounted as /bscb on all compute nodes. Each group has a subfolder under /bscb to use for storage. Similarly as /home, the /bscb file system should not be used directly in any I/O-intensive computations.

Local scratch space on compute nodes

Local scratch directories are mounted as /workdir and /SSD (if present), and this is the space that should be used for computing, i.e., for storing input, output, and intermediate files for running jobs. Depending on the node, the scratch space amounts to 1-2TB and is shared between all running jobs. It is good practice to create a subdirectory within the scratch space for your job's files. A good convention is to make your script create a scratch sub-directory named after your username and job ID, e.g.,

mkdir -p /workdir/$USER/$SLURM_JOB_ID

and use this subdirectory to store all files the job uses as input or creates as output. It is also good practice to delete your scratch files after you are done with them. Prefereably, this should be done at the end of the script, after all results have been saved to permanent storage. The scratch files will eventully be automatically deleted, but removing files you no longer need ensures that there is space for other jobs. If space becomes an issue, then scratch files will be deleted with no grace period.

Left-over scratch files are automatically removed according to the following rules:

The cleaning script is run at the start of each job.
Directory belonging to a no longer active job, named after user and job ID, /workdir/$USER/$SLURM_JOB_ID, (for example, /workdir/abc123/46462, or for an element of a job array: /workdir/abc123/57542_45 or /workdir/abc123/57542-45) will be reccursively removed if all files within this directory are older than 2 hours OR if the overall scratch space occupancy on the node exceeds 75%.
A scratch file located in a directory other than mentioned above (i.e., not named after user and job ID) will be removed only if the owner of the file has no jobs currently running on the node and the file is older than the grace period of 2 hours. However, the grace period will be ignored if the scratch space on the node is occupied in more than 75%.

Local permanent storage space on compute nodes

Parts of the local storage space on most nodes are used as additional permanent storage for individual groups as follows:

cbsubscb01 (Adam Boyko Lab)
cbsubscb02 (Andrew Clark Lab)
cbsubscb03 (Jason Mezey Lab)
cbsubscb04 (Alon Keinan Lab)
cbsubscb05 (Amy WIlliams Lab)
cbsubscb06 (Philipp Messer Lab)
cbsubscb09 (Amy Williams Lab)
cbsubscb10 (Philipp Messer Lab)
cbsubscb11 (Andy Clark Lab)
cbsubscb12 (Jason Mezey Lab)
cbsubscb13 (Alon Keinan Lab)
cbsubscb14 (Andrew Clark Lab)
cbsubscb15 (Amy Williams Lab)
cbsubscb16 (Nina Therkildsen Lab)
cbsubscb17 (Charles Danko Lab)
cbsubscb18 (Xinzhu Wei Lab)
cbsubscb19 (Pennell Lab)
cbsubscbgpu01 (Haiyuan Yu Lab)

The group's files are loated in subfolders of the directory /local/storage. On each node, this directory is exported via NFS, so that it can be mounted on other nodes (including the login nodes) using the utility /programs/bin/labutils/mount_server as follows

/programs/bin/labutils/mount_server <lab_machine> /storage

So, for example,

/programs/bin/labutils/mount_server cbsubscb04 /storage

will make /local/storage from cbsuscb04 availbale on the machine where the command was issues under /fs/cbsubscb04/storage. The command can be executed by any user and it should be called in the beginning of the job script if that job assumes the presence of the mount. This is important since the mounts of /local/storage are not created or maintained automatically. You should not unmount the storage since it can interfere with other users' jobs - unmounting is therefore not supported on cbsubscb machines.

As it is the case with /home and /bscb storage, the mounts under /fs are accessed over the network and must not be used directly by jobs heavy in I/O.

Scheduler - general considerations

As of Feb. 12 2020, the obsolete SGE job scheduler previously running on the BSCB cluster is replaced by SLURM Workload Manager (SLURM = Simple Linux Utility for Resource Management). The functionality of the new scheduler is the same as that of SGE, but unlike the latter, SLURM is under active development and much better suited for the infrastructure consisting of multi-CPU shared-memory compute nodes and GPU machines.

SLURM provides tools for job submission and control ver similar to those offered by SGE, but the syntax of these tools is different and jobs scripts written for SGE have to be slightly modified to work with SLURM.

Detailed documentation for SLURM can be found at https://slurm.schedmd.com/overview.html and multiple links there. In particular, https://slurm.schedmd.com/documentation.html contains thorough explanation of various aspects of SLURM from both the users and admin perspective, and https://slurm.schedmd.com/man_index.html provides detailed description and syntax of all SLURM commands, while a short, two-page summary is available at https://slurm.schedmd.com/pdfs/summary.pdf.

The official documentation is thorough but complicated and the structuring of it will make you dizzy. This document is intended to be a succinct extract relevant to BSCB users transitioning from SGE.

Each job should be submitted to the cluster via SLURM with specific requests of slots (number of threads to be used), RAM memory, and maximum execution time. Based on these requests, SLURM will launch the job on an appropriate node(s) at the appropriate time, when resources become available. Until then, the job will be waiting in queue for its turn to run. Jobs can be submitted to SLURM from any node of the cluster as well as from BioHPC login nodes, cbsulogin.biohpc.cornell.edu, cbsulogin2.biohpc.cornell.edu, and cbsulogin3.biohpc.cornell.edu.

The basic CPU allocation unit is one slot, which corresponds to one thread executed full-time by one of the CPU cores. Note that since on all cluster machines hyperthreading is turned on, the number of threads that can be run concurrently full-time (i.e., the number of SLURM slots) is equal to twice the number of physical CPU cores. If not specified otherwise at job submission, one slot will be allocated. The job will be confined to the requested number of slots regardless of how many threads it actually attempts to spawn.

If not specified otherwise, each job will be allocated 1 GB of RAM by default. This value may be too small or too large - please know memory requirements of your jobs and adjust this requirement accordingly (see also 'How do I know memory needs of my job' later on in this document).

The cluster features 5 queues, referred to in SLURM jargon as partitions, as summarized in the following table:

Partition	Job Time Limit	Nodes	Slots
short	4 hours	cbsubscb[01-19]	1744
regular	24 hours	cbsubscb[01-19]	560
long7	7 days	cbsubscb[01-19]	560
long30	30 days	cbsubscb[01-19]	560 (limit 330 per user)
gpu	3 days	cbsubscbgpu01	32 + 2 GPUs

The partition gpu consists of the only node in the cluster equipped with GPUs: two Tesla P100 PCIe 16 GB cards.

As shown by the table above, all non-GPU nodes and slots on them are shared by all partitions. All 1744 (non-GPU) slots are availbale to jobs shorter than 4 hours. Limits imposed on other partitons ensure that the number of availbale slots decreases with the length of a job so that the cluster canot be flooded with long-term jobs and ample number of slots are alwas available for short ones. For example, 64 slots are always for jobs shorter than 4 hours, jobs longer than 4 hours but shorter than one week can only occupy up to 1120 slots, and jobs longer than 7 days are limited to 560 slots (with an additional limit of 330 slots per user).

If not specified otherwise, a job will be submitted to the short partition. NOTE: unlike SGE, SLURM will not attempt to automatically select a partition that best fits your job. It is your responsibility to know which partitions are available and which will satisfy your job's resource needs. See paragraph Which partition(s) to submit to in the following section.

In the case of multiple users and jobs competing for cluster resources at the same time, job priorities are decided using a "fair share" policy. This means that the scheduler will consider current and past usage by the group a user belongs to as well as group contributions to the cluster purchase in an effort to give all users fair access to the cluster. The fair-share configuration is intended to mimic the arrangement existing on the retired SGE scheduler and it will be fine-tuned as the transition progresses and more user experience is accumulated.

Running batch jobs

SLURM offers several mechanism for job submission. The one most familiar to SGE users is batch job submission. Create a shell script, say submit.sh, containing commands involved in executing your job. In the header of the script, you can specify various options to inform SLURM of the resources the job will need and how it should be treated. The SLURM submission options are preceded by the keyword #SBATCH. There are plenty of options to use. The example script header below contains the most useful ones:

#!/bin/bash -l                   (change the default shell to bash; '-l' ensures your .bashrc will be sourced in, thus setting the login environment)
#SBATCH --nodes=1                (number of nodes, i.e., machines; all non-MPI jobs *must* run on a single node, i.e., '--nodes=1' must be given here)
#SBATCH --ntasks=8               (number of tasks; by default, 1 task=1 slot=1 thread)
#SBATCH --mem=8000               (request 8 GB of memory for this job; default is 1GB per job; here: 8)
#SBATCH --time=1-20:00:00        (wall-time limit for job; here: 1 day and 20 hours)
#SBATCH --partition=long7,long30  (request partition(s) a job can run in; here: long7 and lon30 partition)
#SBATCH --account=bscb09         (project to charge the job to; you should be a member of at least one of 9 projects: ak735_0001,bscb01,bscb02,bscb03,bscb09,bscb10,danko_0001,hy299_0001,nt246_0001)
#SBATCH --chdir=/home/bukowski/slurm   (start job in specified directory; default is the directory in which sbatch was invoked)
#SBATCH --job-name=jobname             (change name of job)
#SBATCH --output=jobname.out.%j  (write stdout+stderr to this file; %j willbe replaced by job ID)
#SBATCH --mail-user=email@address.com          (set your email address)
#SBATCH --mail-type=ALL          (send email at job start, end or crash - do not use if this is going to generate thousands of e-mails!)

NOTE: You have to delete the explanation of each command falling inside the parentheses.

Once the script is ready, cd to the directory where it is located and execute the sbatch command to submit the job:

sbatch submit.sh

Instead of/in addition to the script header, submission options may also be specified directly on the sbatch command line, e.g.,

sbatch --job-name=somename --nodes=1 --ntasks=6 --mem=4000 submit.sh

If an option is specified both in the command line and in the script header, specification from the command line takes precedence. The sbatch command may be run on any node of the cluster as well as on BioHPC login nodes, cbsulogin.biohpc.cornell.edu, cbsulogin2.biohpc.cornell.edu, and cbsulogin3.biohpc.cornell.edu.

Important notes about SBATCH options:

Shorthand notation: Most (although not all) options can be specified (in both the script header and command line) using short-hand notation. For example, options given in the example script above could be requested as follows:

sbatch -N 1 -n 8 --mem=8000 -t 1-20:00:00 -p long_term -A bscb09 -D /home/bukowski/slurm -J jobname -o jobname.out.%j --mail-user=email@address.com --mail-type=ALL

Project to charge a job to (in SLURM this is referred to as 'account'): users belonging to multiple projects should use the -A (or --account) option to charge the job to the appropriate project. If this option is not specified, the job will be charged to the user's default project. To see all projects a user is a member of, run on a login node or any cluster node

sacctmgr show assoc user=abc123

(replace abc123 with your user ID). To see the user's default project, use the command

sacctmgr show user abc123.

Number of nodes: If the program you are running does not use MPI (most software used at BSCB is in this category), you must specify the option --nodes=1 (or -N 1). This will ensure that all threads spawned by that program will run on the same machine chosen automatically by SLURM. If you prefer a specific machine, you may request it using the --nodelist option, e.g., --nodelist=cbsubscb12 will make sure your job is sent to cbsubscb12. Only one machine can be specified - otherwise the request would be inconsistent with -N 1 and an error would occur.

If your program does use MPI so that its threads can be spread out across multiple nodes, you can skip the --nodes option altogether to let SLURM decide on how many and on which nodes to run your job. Or you can explicitly request the number of nodes, e.g., --nodes=2 will spread your MPI processes over two nodes selected by SLURM. If you want specific nodes, request them using --nodelist option, e.g., --nodelist=cbsubscb11,cbsubscb12 will make sure that your job uses both these nodes (plus possibly some other ones if needed to satisfy the number of tasks requested via --ntasks).

Excluding nodes: If you do not want your job to end up running on certain machines, you can exclude such machines using the --exclude (or -x) option. For example, --exclude=cbsubscb10,cbsubscbgpu01 will prevent your job from being sent to any of the two specified nodes.

Which partition(s) to submit to: you can specify a comma-delimited list of partitions in which your job can run and let SLURM select one of them. For example, with sbatch option --partition=regular,long7, your job may end up running in either of the two partitions. The selected partition will be such that any constraints imposed on the job (like number of slots, memory, or run time limit) will be satisfied. If none of the requested partitions can satisfy these constraints, the sbatch (or salloc) command will fail and the job will not be scheduled. Attention: if no partition is specified, SLURM will attempt to submit your job to short partition (i.e., it will not automatically select the best available queue). This behavior is different from that of SGE. In summary, given the current configuration, selection of partitions that gives your job the best chance of being scheduled depends on the intended run time of your job:

Intended job duration	Partition specification	Slots available
up to 4 hours	--partition=short	1392
4 - 24 hours	--partition=regular,long7,long30	1372
24 hours - 7 days	--partition=long7,long30	937
7 days - 30 days	--partition=long30	500
GPU, up to 36 hours	--partiton=gpu --gres=gpu:tP100:1	32

Startup and output directories: directories specified by the --chdir and --output options (or - if these are not specified - the directory where sbatch command was run) must be mounted and visible on the node(s) where the job runs. Avoid using directories network-mounted from compute nodes (such as, for example, /fs/cbsubscb09/storage), since these are not guaranteed to be available in the very beginning of a job (even if the /programs/bin/labutils/mount_server command is issued later in the script). The /home file system is always mounted on all compute nodes, so it is a good idea to have --chdir and --output point to somewhere within your home directory (or submit your job from there). Remember though, that all input and output files accessed by the job must be located within the /workdir (and/or /SSD) directory, local to each node.

Inheriting user's environment: If you want the job to run in the same environment as your login session, make sure that the first line of the script invokes the shell with the -l (login) option (#!/bin/bash -l). This will make the script read your .bashrc file and properly set up all environment variables. This may be necessary if your jobs requires a lot of environment customization, typically done through .bashrc.

Parallel jobs: If the program you are about to submit can spawn multiple processes or threads and you intend to use N>1 such processes or threads, request that many slots for your job using option --ntasks N. Different programs are parallelized in different ways - one of them involves a library called OpenMP. To enable OpenMP multithreading, insert the following line somewhere in the beginning of your script (substitute an actual number for N):

export OMP_NUM_THREADS=N

If you are not sure whether your program actually uses OpenMP for parallelization - insert that line anyway. However, beware of programs that use both types of parallelization simultaneously, i.e., created multiple processes - each of the multithreaded with OpnMP.

Defaults: Unless requested otherwise using --ntasks and/or --mem options, each job is allocated one slot, i.e., enough CPU cycles to run one thread, and 1GB of RAM memory. Your job will be confined to these requested resources regardless of how many threads your program attempts to spawn. For example, if only one slot is allocated and your program launches 4 threads, each of these threads will be working at 25% of CPU. If the granted memory is not sufficient for your program, it will most likely crash, in which case you will have to request more memory upon the next submission attempt.

How do I know memory needs of my job?

If you do not know the answer based on studying the program's manual and related publications, you can run one or more jobs with increasing amounts of requested memory and record the job ID of the first job which did not crash (while the job is running, the job ID can be found in the first column of the output of the squeue command). After this job finishes, run the command

sacct -j <jobID> --format=JobID,User,ReqMem,MaxRSS,MaxVMSize,NCPUS,Start,TotalCPU,UserCPU,Elapsed,State%20 -j <jobID>

and look for the value printed in the MaxRSS column - this is the maximum memory actually consumed by the job. Request a slightly larger value for the subsequent jobs running the same code with similar input data and parameters.

If the number in the MaxRSS column is very close to memory requested at submisison, chances are the job is actually trying to allocate more physcial memory using swap space on disk. The resulting frequent communication with swap (referred to as thrashing) slows the job down and may manifest itself in UserCPU (time spent on the actulat work) being much smaller than TotalCPU (product of elapsed time and the number of slots reserved). If this happens, you may find it useful to limit the amont of virtual memory a job can allocate to the physcial memory request by including the following statement in the beginning of the submitted script (but after all the SLURM header lines, if present):

ulimit -v $(ulimit -m)

Since the physical memory used cannot be larger than allocated virtual memory, the statement above will prevent the job from 'spilling over' to swap and cause it to crash (rather than thrash) if it attemps to exceed physical memory requested at submission, giving you an immediate signal that more such memory is needed. Note however, that memory needs established this way may be overblown, since most programs allocate more virtual memory than physical RAM they ever use. Moreover, restricting virtual memory may have detrimental effect on performance. Thus, the true memory need of your job should be found by examining MaxRSS and once known and requested through the --mem option, the ulimit statement above can be removed.

To learn how much of the requested resources your jobs actuall consume, run the command get_slurm_usage.pl on one of the login nodes. For example,

get_slurm_usage.pl cbsubscb 01/20/20 3

will produce per job averages (along with standard deviatons) of requested and actually used number of slots and memory, computed over all your jobs longer than 1 minute which completed between Jan 18 (00:00) and Jan 21 (00:00) 2020. Of particular interest here would be the discrepancies between the requested memory (avReqMem) and memory that has actually been used (avUsedMem). If the latter number is much smaller than the former, your have been requesting too much memory for your jobs.

Alternative specifications of the number of slots and RAM

In SLURM, the option --ntasks (or -n in shorthand syntax) denotes the number of tasks. By default, each task is assigned one thread and therefore this is also equal to the number of threads. The default setting may, however, be modified to request multiple threads per task instead of one. For example,

sbatch --ntasks=8 --cpus-per-task=3 [other options]

will request 8*3=24 slots in total. From the point of view of most (if not all) BSCB jobs, this would just be equivalent to --ntasks=24.

Similarly, instead of requesting the total job memory via the --mem option, one could specify certain amount of memory per thread (cpu), using --mem-per-cpu option. For example,

sbatch -n 8 --cpus-per-task=3 --mem-per-cpu=2G [other options]

will request 8*3*2G = 48 GB of RAM for the job. For most BSCB jobs, this would be equivalent to the command

sbatch -n24 --mem=48G [other options]

Porting SGE scripts to SLURM

To port an existing SGE submission script to SLURM, the header lines starting with #$ have to be replaced with their SLURM counterparts starting with #SBATCH, as in the example above.

Furthermore, any environment variables defined by SGE and referenced in the job script have to be replaced by their SLURM counterparts. For the complete list of environment variables provided by SLURM, see section 'OUTPUT ENVIRONMENT VARIABLES' of https://slurm.schedmd.com/sbatch.html. Here we quote only the ones that seem most important:

SLURM_JOB_CPUS_PER_NODE : number of CPUs (threads) allocated to this job

SLURM_NTASKS : number of tasks, or slots, for this job (as given by --ntasks option)

SLURM_MEM_PER_NODE : memory requested with --mem option

SLURM_CPUS_ON_NODE : total number of CPUs on the node (not only the allocated ones)

SLURM_JOB_ID : job ID of this job; may be used, for example, to name a scratch directory (subdirectory of /workdir, or output files) for the job. For array jobs, each array element will have a separate SLURM_JOB_ID

SLURM_ARRAY_JOB_ID : job ID of the array master job (see section 'Array jobs' later in tis document)

SLURM_ARRAY_TASK_ID : task index of a task within a job array

SLURM_ARRAY_TASK_MIN, SLURM_ARRAY_TASK_MAX : minimum and maximum index of jobs within the array

Monitoring/controlling jobs and queues

Useful commands (to be executed on any cluster node or on one of the login nodes):

sinfo : report the overall state of the cluster and queues

scontrol show nodes : report detailed information about the cluster nodes, including current usage

scontrol show partitions : report detailed information about the queues (partitions)

squeue : show jobs running and waiting in queues

squeue -u abc123 : show jobs belonging to user abc123

scancel 1564 : cancel job with jobID 1564. All processes associated with the job will be killed

slurm_stat.pl cbsubscb: summarize current usage of nodes, partitions, and slots, and number of jobs per user (run on one of the login nodes)

get_slurm_usage.pl: generate information about average duration, CPU, and memory usage of your recent jobs (run the command without arguments to see usage) - this may help assess real memory needs of your jobs and show whether all requested CPUs are actually used.

The node on which a job is running writes STDOUT and STDERR (screen output) generated by the job script to file(s) given by the --output option. Unless specified otherwise, these files will show up in the directory from which the job was started during run time. This means that you can monitor the progress of jobs by watching these files (of course, if output from individual commands run from within the script is redirected to other files, it will not show up as screen output from the job script).

Running interactive jobs

An interactive shell can be requested using the srun command with the --pty bash option. This command accepts the same submission options as sbatch. In particular, the partitions, requested number of slots, and amount of RAM should be specified, e.g.:

srun -n 2 -N1 --mem=8G -p short --pty bash -l

By default, you get 1 slot and 1 GB of RAM in regular partition. The srun command will wait until the requested resources become available and an interactive shell can be opened. Option -l ensures your .bashrc script will be executed (i.e., your standard login environment will be set in the interactve session).The wait time can be limited, e.g., adding the option --immediate=60 will cause the command to quit if the allocation cannot be obtained within 60 seconds.

Running interactive jobs using SCREEN (here we assume you are familiar with the screen program)

A SCREEN session can be opened on the cluster by submitting (using sbatch) a special script:

sbatch -N 1 <other_options> /programs/bin/slurm_screen.sh

Once started, the job will create a SCREEN session on one of the nodes. This session will be subject to all CPU, memory, and timing restrictions imposed by the partition and any sbatch options you specify. Use squeue to find the node where the job has started and ssh to that node. Once on the node, use screen -ls to verify the SCREEN session has been started there. Attach to this session using screen -r. Within the session, you can start any number of shells you need and run any processes you need. All the processes together will share the CPUs and memory specified at job submision. You can detach from the SCREEN session at any time and log out of the node - the session will keep running as long as the job is not terminated. When the SLURM job hosting you SCREEN sesison terminates (because it exceeds the time limit or is canceled), the session and all processes within it will also be terminated. However, if you terminate the SCREEN session in any other way (e.g., by closing all the shells or sending the quit signal), the corresponding SLURM job will not end automatically - you will need to cancel it 'manually' using scancel command to free up the resources and stop being charged for them.

Running jobs requiring graphics (under construction)

Make sure that the X-windows manager is running on your laptop. Log in via ssh to one of the cluster nodes or a login node making sure that the X11 forwarding is enabled. In that ssh sessions, run

salloc --x11

(you can specify other salloc options as well). In a shell that opens on an allocated node, launch a graphical application you want to use. The graphics generated by this application will be rendered on your laptop.

Running jobs requiring GPUs (somewhat experimental)

To use one GPU on the cbsubscbgpu01 node, add --partition=gpu --gres=gpu:tP100:1 to your sbatch options. To grab both GPUs, use --partition=gpu --gres=gpu:tP100:2. Of course, for this to make sense, your code needs to be GPU-aware (i.e., written and compiled to use GPUs).

Jobs requirung AVX2 instruction set

Such jobs can only be run on the six nodes which support this instruction set: cbsubscb[13-17], cbsubscbgpu01. To submit a job to one of these nodes, add an sbatch option excluding all other nodes: --exclude=cbsubscb[01-12].

Docker containers in SLURM jobs

Docker containers can be started from within a SLURM job using the docker1 command. However, such containers will not automatically obey the CPU, memory, or time allocations granted to a job by SLURM. Therefore, these limits have to be imposed explicitly on the container. For example, if a job is submitted with SLURM options --mem=42G -n 4, these restrictions must be passed on to the docker container via the docker1 command as follows:

docker1 run --memory="40g" --cpus=4 <image_name> <command>

(note that the memory made available to the conatiner should be somewhat smaller than the amount requested from SLURM). If neither of the --mem or -n (same as --ntasks) SLURM options are explicitly specified at job submittion, the defaults are --mem=1G and -n 1, respectively, and these should be used in docker1 command above.

There is currently no tested way of imposing a time limit on docker containers.

Array jobs

Array jobs can be submitted by adding the --array option to the sbatch command. For example,

sbatch --array=0-30 myscript.sh

will effectively submit 31 independent jobs, each with SLURM submission parameters specified in the header of myscript.sh (or via command line options). Each such job will have the environment variable SLURM_ARRAY_TASK_ID set to its own unique value between 0 and 30. This variable can be accessed from within myscript.sh and used to differentiate between the input/output files and/or parameters of different jobs in the array. The maximum index of an array job (i.e., the second number in the --array statement) is set to 10000.

An array job will be given a single ID, common to all elements of the array, available as the value of the environment variable SLURM_ARRAY_JOB_ID. Each individual job in the array will be assigned its own job ID unique within the cluster and available as the value of SLURM_JOB_ID. Individual jobs within the array will also be referred to by various tools (for example, in the output from squeue command) using a concatenation of SLURM_ARRAY_JOB_ID and the value of SLURM_ARRAY_TASK_ID, for example: 1801_20 will correspond to job 20 of the array job 1801.

The number of jobs of the array running simultaneously can be restricted to N using the %N construct in the --array option. For example, the command

sbatch --array=0-30%4 myscript.sh

will submit a 31-element array of jobs, but only 4 of them will be allowed to run simultaneously even if there are unused resources on the cluster.

Before you submit a large number of jobs (as a job array or individually), make sure that each of them is long enough to be treated as a single job. Remember there is a time overhead associated with SLURM handling of each job (it needs to be registered, scheduled, registered as completed, etc). While it is not completely clear how large this overhead is, it is a good practice to avoid submitting multiple jobs shorter than a couple of minutes. If you do have a lot of such short tasks to process, bundle them together, so that a single SLURM job runs a few of them in a loop, one after another.

Other considerations

Backups

Unless arranged on an individual or per group basis, there are no default backups on the cluster. Beware!

File permissions

The default file permissions are very restrictive. To make your files group-readable by default, do the following:

echo "umask 022" >> ~/.bashrc

To fix permissions on already-existing files, do something like this:

chown -R username:bscb07 $HOME
find $HOME -type f -exec chmod 644 {} \;
find $HOME -type d -exec chmod 755 {} \;

Passwordless ssh

As always, you can set up ssh keys to log into the cluster machines without a password. From the login machine, do the following:

cd
ssh-keygen -t rsa            # press enter a few times to skip over questions
cat .ssh/id_rsa.pub >> .ssh/authorized_keys
echo 'Host *' >> .ssh/config
echo StrictHostKeyChecking no >> .ssh/config

chmod 700 .ssh
chmod 600 .ssh/authorized_keys .ssh/config

You can also append the contents of .ssh/id_rsa.pub to .ssh/authorized_keys on other machines (and vice-versa) to authenticate logins between the cluster and your other workstations.

Software

All the software installed on BioHPC Cloud nodes is available on the cluster, complete list of installed software with versions and other information is on BioHPC Cloud software page. If you need something else installed, you can install it in a local directory, or put in a request to support@biohpc.cornell.edu.

Hardware details

cbsubscb01, cbsubscb02, cbsubscb03, cbsubscb04, cbsubscb05, cbsubscb06, cbsubscb07

4x Intel Xeon E5 4620 2.20GHz (32 regular cores, 64 hyperthreaded cores)
256GB RAM (512 GB for cbsubscb05)
2x500GB of SSD storage (RAID0, /SSD)
4x3TB of SATA storage (RAID5, /local and /workdir, 9TB total accessible)

cbsubscb09, cbsubscb10

4x Intel Xeon E5 4620 v2 2.60GHz (32 regular cores, 64 hyperthreaded cores)
512GB RAM
2x500GB of SSD storage (RAID0, /SSD)
4x3TB of SATA storage (RAID5, /local and /workdir, 9TB total accessible)

cbsubscb08, cbsubscb11, cbsubscb12

4x Intel Xeon E5 4620 v2 2.60GHz (32 regular cores, 64 hyperthreaded cores)
256GB RAM
2x500GB of SSD storage (RAID0, /SSD)
6x4TB of SATA storage (RAID6, /local and /workdir, 16TB total accessible)

cbsubscb13

4x Intel Xeon E7 4850 v3 2.20GHz (56 regular cores, 112 hyperthreaded cores, AVX2)
512GB RAM
2x500GB of SSD storage (RAID0, /SSD)
6x6TB of SATA storage (RAID10, /local and /workdir, 18TB total accessible)

cbsubscb14

4x Intel Xeon E7 4830 v3 2.0GHz (56 regular cores, 112 hyperthreaded cores, AVX2)
512GB RAM
4x8TB of SAS storage (RAID5, /local and /workdir, 22TB total accessible)

cbsubscb15

4x Intel Xeon E7 4830 v3 2.0GHz (56 regular cores, 112 hyperthreaded cores AVX2)
512GB RAM
6x6TB of SAS storage (RAID5, /local and /workdir, 28TB total accessible)

cbsubscb16, cbsubscb17

1x AMD EPYC 7702 64-Core Processor (64 regular cores, 128 hyperthreaded cores, AVX2)
1 TB RAM
12x 16TB SAS storage (RAID6, /local/ and /workdir, 146 TB accessible)
3.5 TB SSD

cbsubscb18

2 x AMD EPYC 7532 32-Core Processor (64 regular cores, 128 hyperthreaded cores, AVX2)
1 TB RAM
RAID6, /local/ and /workdir, 146 TB accessible
3.5 TB SSD

cbsubscb19

2x AMD EPYC 9554 64-Core Processor (64 regular cores, 128 hyperthreaded cores, AVX2)
2 TB RAM
12x 24TB SAS storage (RAID6, /local/ and /workdir, 219 TB accessible)

cbsubscb20

1x AMD EPYC 9554 64-Core Processor (64 regular cores, 128 hyperthreaded cores, AVX2)
1 TB RAM
3 x 7.45TB NVME storage (RAID5, /workdir, 214 TB accessible)

cbsubscbgpu01

2x Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz (16 regular cores, 32 hyperthreaded cores, AVX2)
512GB RAM
6x4TB of SATA storage (RAID6, /local and /workdir, 15TB total accessible)
2x NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB]

storage cluster

Gluster cluster consisting of 3 component servers. Total available storage 280TB, mounted as /bscb.

Contents: