## BioHPC Environment ### System Overview - This system is part of the Cornell BioHPC Cloud environment. - Each user's home directory (/home/$USER) is located on a shared network-mounted Lustre filesystem. - Each server provides a local /workdir filesystem that is physically attached to that server and is generally preferred for active computation, temporary files, intermediate results, and large datasets. - Unless otherwise specified, assume users connect to the server via SSH and run software directly on the server. ### Long-Running Jobs - For interactive work and short analyses, users may run commands directly from an SSH session. - For long-running jobs executed directly on the server (without Slurm), recommend using a persistent terminal session such as screen. - If a command is expected to run longer than the user's SSH session, suggest creating a screen session before starting the job. - Do not assume that a long-running command will survive an SSH disconnect unless it is running inside screen, tmux, or a similar session manager. - Screen tutorial can be found at https://biohpc.cornell.edu/lab/doc/Linux_exercise_part2.pdf ## Working Directory - Store all working files, intermediate results, and large datasets under `/workdir/$USER`. Do not use your home directory (`/home/$USER`) for compute-intensive workflows or large files. - If software requires a large temporary directory, do not use `/tmp`, as it may have limited capacity and is shared across users. - Instead, create a dedicated temporary directory under your work area, for example: ```bash mkdir -p /workdir/$USER/tmp ``` ## Docker Environment ### Docker Command Requirements - Do not use the docker command directly on this server. - Use docker1 for all Docker operations. - docker1 is a site-specific wrapper that executes Docker commands with the required privileges (equivalent to sudo docker). - For any Docker-related task, always substitute docker1 for docker. - Examples: - `docker1 images` - `docker1 ps` - `docker1 build ...` - `docker1 run ...` ### Container Filesystem Restrictions - Containers may only mount directories under /workdir/$USER. - For this account, paths under /workdir/qisun are permitted. - Never mount directories from /home, /tmp, /usr, /etc, or any location outside /workdir/$USER unless explicitly instructed by the user. - Examples: - docker1 run -v /workdir/$USER/project:/workspace ... ## Installed Software Lookup ### General Software information - Before installing software, check this web page first: https://biohpc.cornell.edu/lab/userguide.aspx?a=software. - In this HTML page, all software names are listed. For each software package, follow the linked documentation page for instructions on loading modules and running the software. ### Nextflow - Before running any Nextflow workflow, load an appropriate Nextflow module: module load nextflow/25.4.3 - Available Nextflow versions include: - 25.4.3 (preferred) - 24.10.1 - 23.10.1 - 22.10.7 - 19.04.1 - When possible, run Nextflow workflows using Apptainer or Singularity containers. - Singularity 1.4.0-1.el9 is already installed. Do not spend time checking whether Singularity or Apptainer is available. - Unless the user explicitly requests Slurm execution, run Nextflow workflows locally on the server rather than submitting jobs through Slurm. ### Conda - General Guidance - Conda should not be the default method for installing software on BioHPC. - When possible, prefer: - Software already installed on BioHPC - Docker or Apptainer/Singularity containers - Conda Installation Location - For users with access to a hosted server - install Miniconda and create the Conda base environment under /workdir/$USER. - For users without access to a hosted server: - Install Miniconda and create the Conda base environment in the default location under the home directory. - Recommended Distribution - Miniconda is the recommended Conda distribution. - Avoid installing the full Anaconda distribution unless specifically required. - BioHPC Conda installation and usage instructions: https://biohpc.cornell.edu/lab/userguide.aspx?a=software&i=574#c ### Apptainer - Apptainer is installed and available in the default PATH. - Singularity compatibility is available and the singularity command is also available in the default PATH. ### Python - Default Python Environment - The default Python installation is Python 3.9.25. - The following commands are available in the default PATH and use Python 3.9.25: - python - python3 - pip - pip3 - Unless otherwise specified, assume that Python-related commands, package installations, and virtual environments use the default Python 3.9.25 installation. - Switching Python Versions - To use a different Python version, load the appropriate module before running Python commands. - Example: - module load python/3.12.7 - After loading a Python module, use the corresponding python, python3, pip, and pip3 commands from that environment. - Available Python Versions - 3.12.7 - 3.10.6-r9 - 3.9.25 (default) - 3.6.15-r9 - 2.7.15 - 2.7.5 - Python Package Installation - Before installing Python packages, determine whether the required package is already available through the system Python installation or a module. - User can install python packages in default path, under ~/.local ### R - Default R Environment - The default R installation is R 4.0.5. - The following commands are available in the default PATH and use R 4.0.5: - R - Rscript - Switching R Versions - To use a different R version, load the appropriate module before running R commands. - Example: - module load R/4.5.2 - After loading a R module, use the corresponding R, Rscript. - Available R Versions - 4.0.5-r9 - 4.1.3-r9 - 4.2.3 - 4.3.3 - 4.4.2 - 4.4.3 - 4.5.2 ### R studio - There are three supported ways to start RStudio on BioHPC: - RStudio Server directly on the host computer. This is the default method. - RStudio Server through Docker. This is recommended only for experienced users. - RStudio Desktop through VNC. This method is not recommended unless specifically needed. - Instructions for all three methods are available here: - https://biohpc.cornell.edu/lab/userguide.aspx?a=software&i=266#c - R Versions: - Multiple R versions are supported. - Check the BioHPC RStudio instruction page for the currently supported R versions and how to select them. - Required Setup for Host RStudio Server - When using RStudio Server directly on the host computer, each user should run the following command before starting or troubleshooting RStudio: - /programs/rstudio_server/mv_dir - This command performs two setup actions: - Removes cached RStudio session files. - Moves the RStudio cache directory under /workdir/$USER and creates a symbolic link from ~/.local/share/rstudio. - Troubleshooting - For RStudio troubleshooting, refer to the BioHPC FAQ: https://biohpc.cornell.edu/doc/UsingRstudioServer.html ### Java - Default Java Environment - The default Java installation is openjdk 13.0.2. - The following commands are available in the default PATH and use open jdk 13.0.2: - java - Switching java Versions - To use a different java version, load the appropriate module before running java commands. - Example: - module load java/21.0.1 - After loading a java module, use the corresponding java command. - Available Java Versions - 13.0.2 (default) - 1.7.0 - 1.8.0 - 21.0.1 ## Parallelization - Determine whether the user intends to run jobs through Slurm or directly on a login/workstation server. - If the user is using Slurm: - Prefer Slurm-native job parallelization methods (job arrays, multiple jobs, workflow managers, etc.). - Do not recommend GNU Parallel unless the user specifically requests it. - If the user is running jobs locally on a server without Slurm: - Prefer GNU Parallel for command-line job parallelization. - GNU Parallel is installed and available in the default PATH. - Do not suggest installing alternative parallelization tools unless specifically requested. - If the execution environment is unclear, ask whether the workload will be run through Slurm or directly on the server before proposing a parallelization strategy. ## AI Agent behavior - Do not automatically execute pipelines, workflows, batch jobs, or long-running computational tasks. - Instead, generate a shell script containing the required commands and save the workflow in that script. - Present the script to the user and instruct them to review it before execution. - Assume that all substantial compute jobs should be launched manually by the user in a separate terminal session. - Do not start Nextflow workflows, Slurm jobs, Docker containers, Apptainer containers, or other long-running processes unless the user explicitly requests execution. - When possible, separate workflow preparation (performed by the agent) from workflow execution (performed by the user). - For potentially expensive jobs, provide the exact command or script needed and explain how the user can run it.