institute of biotechnology >> brc >> bioinformatics >> internal >> biohpc cloud: user guide
 

BioHPC Cloud:
: User Guide

 

 


BioHPC Storage

Contents:

 

On the Linux workstations, you have access to two kinds of storage:

  • Networked storage is mounted on all BioHPC servers under /home, /home2, /programs, and /shared_data. Each user is given a modest amount of free network storage to use in /home and /home2, with additional network storage available for purchase.
  • Local storage is storage physically attached to a specific BioHPC server, and is mounted at /workdir and /SSD. Local storage is available to a user for the duration of a reservation (or permanently for group members of hosted machines).

BioHPC storage (network or local) is not backed up by default. We strongly encourage all users to develop and implement a backup plan. BioHPC does provide backups as a separate service - details are available on this page .


Networked Storage Overview

BioHPC Cloud currently has a network storage system of size 2,766TB (2.7PB). The storage is implemented as two Lustre clusters: the cluster mounted at /home has capacity 2,012TB, and /home2 has 754TB capacity. There is a limited free storage available to each BioHPC user (see Free storage below). Any user can purchase extra storage for $105.00 per TB per year- this is one of the lowest storage prices available anywhere. We can offer this low price since we buy the storage in big chunks and we only recover the cost as it is, i.e. hardware, computer room, and maintenance costs.


Home and Home2

Your home directory is part of the network storage system and is located at /home/username, e.g., /home/abc123. This is where you end up right after you log into a machine, and where you should keep your configuration files. For every top-level directory in /home, there is a directory in /home2 with the same name and permissions. For example, in addition to user abc123's home directory /home/abc123 there is also a directory /home2/abc123, ready to be used by anyone having sufficient access rights. For each group directory /home/mylabgroup there is also /home2/mylabgroup directory with the same permissions and ready to be used. Both /home and /home2 are network-mounted high-performance file systems, but /home has been optimized for safety, whereas /home2 has been optimized for speed. Both /home and /home2 can be accessed from all BioHPC machines.

The differences between the directories located in /home and /home2 are:

  • You may NOT run computations directly on data residing in /home. However, you may do computations on files in /home2. Alternatively, you can copy files to local storage before computing.
  • /home has been configured for data safety, whereas /home2 has been configured for maximum access speed. The difference in safety is fairly small overall. On /home2, large files are spread out among multiple storage units, allowing for fast parallel data transfers, whereas each file in /home is localized on one storage unit.
  • Your home directory resides in /home, this is where you will start when you log in, and you should keep user configuration files in /home.

Any quota applies to the combined storage of both /home and /home2 directories. If you have paid storage, then the usage applies to the combined storage of both /home and /home2 directories. To start, /home2 directories are empty, so they do not influence your balance until you deposit data in /home2. Data located in /home2 directories can be backed up using the standard BioHPC backup system, but is not backed-up by default.

Our suggested strategy for utilizing both storage systems is to use /home as long-term storage, and use /home2 for computing and short-term storage. The advantage of using /home2 for computing is that you don't need to copy data to /workdir and then copy the output back. Please note that local storage (/workdir or /local/storage) is still faster than /home2 storage, especially on modern servers equipped with SSD and NVMe devices.


Local storage

Each machine has local storage located in the directory /workdir (regular disks) or /SSD (SSD storage on selected workstations). The actual amount of storage available is listed on the Reservations page under the machine name (e.g. "4TB HDD; 1TB SSD" means 4TB regular disk storage under /workdir and 1TB of SSD storage under /SSD). When logged into a machine, you can check the amount of used/available storage on the command-line by executing a command like "df -h /workdir".

After logging in, each user should create his/her own subdirectory under /workdir or /SSD (e.g., /workdir/abc123) and put all the files to be processed in that subdirectory rather than in the home directory. When launching an application, make sure that it always reads/writes files from/to local disk - this is usually accomplished by executing a "change directory" command similar to "cd /workdir/abc123" and starting the application from there. Note: The directory /workdir (and its subdirectories) are local to each machine, i.e., /workdir on cbsuwrkst2 is not accessible from cbsuwrkst3, etc.

Local storage automatic deletion (rental servers only) When your reservation ends, the contents of /workdir and /SSD may be wiped out automatically to make space for the next user's data. Therefore, any important files (calculation results, for example) need to be transferred to your home directory before you log out. Files in /workdir or /SSD directory should be transferred out of the workstation and deleted after the computations are done. Unfortunately many users leave these files behind creating disk space problems for other users in the future. To prevent this we have implemented an automated cleaning procedure that removes old files from /workdir at 3:00am every day. The rules for removing old files are:

  • Files of the current reservation are NEVER deleted

  • FIles for reservations that ended more than 7 days ago are ALWAYS deleted.

  • If there is more than 50% of free disk space available, files of the 2 most recent previous reservations are not deleted in addition to the current reservation (if any).

  • If there is less than 50% but more than 30% of free disk space available, files of the 2  most recent reservations are not deleted (including current reservation, if any)

  • If there is less than 30% of free disk space available, files of the most recent reservation are not deleted (including current reservation, if any).

If you need to clean /workdir at other time than 3:00am you can do it by running script /programs/config/clean_workdir - it will start the same procedure as is run periodically at 3:00am.


Transferring and sharing data

Networked storage (/home and /home2) is available on all workstations and login nodes. Data can be transferred to and from networked storage without any reservation, all that you need is an active BioHPC user account. The best way to transfer data is to use scp or sftp protocol (common Windows client is FileZilla). For step-by-step explanation of data transfer, please refer to "Access".

You can share your files with other BioHPC Cloud users by setting file/directory permissions. For external users (without a BioHPC account), you can share data via Globus (Using Globus to Share Data), or you can create a temporary guest account .

Members of the Molecular Biology & Genetics and Entomology Departments have access to networked storage space paid for by their respective Department. Other departments and groups have hosted file servers with mountable network filesystems. Instructions for how to mount (map) this storage to your PC or Mac are available in this document .



Network storage free allocation and quotas

All users are granted a modest amount of free storage for their home/home2 directories, to ensure that they can perform basic operations without needing to purchase storage. The amount of free storage is determined as follows:

  1. Users associated with active BioHPC credit accounts, who have a current reservation (including hosted servers), or who have purchased additional paid storage receive 200Gb of free storage for their home directory.
  2. All other users receive 20Gb free storage.

Quotas are set for home, home2 directories and paid storage directories. All quotas are soft: this means that you are not prevented from writing to the filesystem when the quota is exceeded.

  • For unpaid home directories, your quota is either 200Gb or 20Gb (according to your free storage allocation, see above). If you exceed the quota, you will receive frequent email notifications and are expected to address the issue promptly. If you do not remove excess storage or purchase storage credits, after some time your account will be locked, and eventually your data will be deleted.
  • For paid storage, you can set your own quota. These quotas are informational only. When you exceed the quota, you will receive a notification by email, and then your quota will automatically be increased. In this case, the quota can be thought of as a 'warning threshold', it is designed to help keep you aware of how much storage you are paying for.
    • To change your warning threshold, go to My Storage page and click the 'Purchase or modify storage' button under the appropriate directory. You can choose to purchase 0 units and change the Warning Threshold only. The threshold must be higher than the amount of storage you currently have.
    • By default, quota warning emails for group storage are sent to all group members . To modify this, conatct BioHPC staff .

Checking storage usage

There are several tools available to see how much storage you are using

  • On the webpage, you can check your total network storage usage, broken down by user and group storage: My Storage page. For purchased storage, it will also give your storage credit balance, as well as an 'expiration date': this is just a calculation of when you will run out of storage credits assuming no change to your current usage.
  • lfs-du-project command: The command lfs-du-project /home/abc123 will (instantaneously) give you the total size of the directory /home/abc123. However, it can only give total sizes for entire home or group storage directories. If you try to use this command to get the size of a sub-directory within your home, it will still return the size of your entire home directory. This command also works for top-level group/user directories in /home2.
  • du command: The command du -sh /home/abc123/directory will give the total size of the directory /home/abc123/directory and its contents (including sub-directories). Or, a command like du -ha --max-depth=1 /home/abc123 | sort -h will give the total size of all files/directories within /home/abc123, sorted by size. However, if there are a large number of files contained in the directory, the du command can be very slow.
  • ncdu command: If you need to call du repeatedly, ncdu will be a better choice. The command ncdu /home/abc123 will scan the entire directory /home/abc123 once (which takes a similar amount of time as a single du call), and then opens a text-based, interactive browser where you can explore the contents of the directory. Once the scan is finished, you will see all the contents of the directory, sorted by size. Any entry beginning with a / is a directory. You can use the right arrow (or enter) to go inside a directory and explore its contents, and the left arrow to return to the parent directory. Up and down arrows navigate through the files/directories, and you can press d to delete files (or entire sub-directories). (The delete command will ask for confirmation, but once you confirm, the files are permanently deleted!) You can also press q to quit or ? for help. Because the scan can take some time, we recommend running this command inside a persistent session (such as screen or VNC), so that the results of the scan are not lost if your connection to the server is interrupted.

Elevated permissions for managing storage

We have some tools that allow designated users to manage their group's files with root privileges. There are two levels of access possible, providing the following commands:


  • read-only access: sudo-ls, sudo-du
  • write access: sudo-chown, sudo-chgrp, sudo-chmod, sudo-ncdu

The above commands are root-access versions of their counterpart commands (without the sudo- prefix), with the same syntax/options, but with some limitations. They first check that any files accessed by the commands belong to the group where the user has authorized access (i.e., on a hosted server in /local/storage, /workdir, or /SSD, or in /home/groupName, /home2/groupName, or the user's own home directories). Additionally, some options to the above commands have been disabled (for example, options that allow following symbolic links, or entering a shell in ncdu).

Access to these commands can be granted to any user, at the request of the PI/owner of a hosted machine or BioHPC storage group, at either read- or write-levels. The PI should email support@biohpc.cornell.edu to request access for anyone in their group.

Note that a command like sudo-ncdu /workdir or sudo-ncdu /local/storage can be very useful for storage management on hosted servers, as you can interactively browse contents of directories and delete unnecessary files.


Purchasing storage

If you need more storage than the free allocation, you can purchase storage credits. The credits can be applied to your home directory, or to a shared group directory.

  • Storage rates and charging

    Storage may be purchased on the My Storage page using Cornell Account, Credit Card, or with a Purchase Order (for pre-authorized users only). Storage credits are purchased in units of 'Terabyte-years', at a cost of $105.00 per Tb-year (sold in whole units only). At BioHPC, you only pay for storage you actually use. For example, 1 Tb-year of storage will purchase 1 Tb storage for one year, or 2 Tb for half a year, or 0.5 Tb for 2 years. Your storage credit balance is updated every day based on a daily snapshot of your actual usage.

    If you run out of storage credits, you will be informed by email and asked to purchase additional credits. You will continue to accrue a negative balance, unless your storage usage falls below your free storage allocation. If you do not address a negative balance promptly, your account will be locked, and eventually your data will be deleted.

  • Purchased storage directory

    There are two options to consider when you purchase storage:

    1. Purchase storage for a home/home2 directory: This can be done by navigating to My Storage Page, and clicking on 'Purchase or modify home directory storage'. With this option, you will only be charged for usage in your home directory that exceeds your free storage allocation.
    2. Purchase storage for a shared directory: This is a good option for shared storage space within a lab, or for a group project. To get started with this option, you need to contact BioHPC and request the creation of a storage group. You will need to chose a group name and a group owner. If the name of the group is abc123Lab, then the storage directory will be /home/abc123Lab (and /home2/abc123Lab), and the group owner will be able to add or remove group members by navigating to the My Groups page. All group members will see the directory listed on the My Storage Page and can purchase storage credits for it, using the link 'Add or modify abc123Lab storage'.

      Home directories in shared directories: It can be convenient to move the home directories of users to a shared directory with purchased storage. This way, only one storage purchase is necessary to account for the storage space used by multiple users. Additionally, it can keep home directories of group members organized in a single location. Moving home directory applies both to /home and /home2 directories.

      • Group owners can move a group member's home to group storage by going to My Groups, clicking 'Group users', and then there is a link for each user to move (or remove) their home to group storage.
      • Home directories are moved by creating a symbolic link from /home/username to /home/abc123Lab/username. In this way, the move is fairly invisible to the user, and the user can access their home at either of these paths.
      • When a user's home is moved to paid group storage, they still receive up to 200Gb free storage for their home directory. The size of the user's home, up to the user's free storage allocation, is subtracted from total group storage usage each day before the group storage credit balance is updated.
      • Note: it is not possible to move a home directory to group storage if the home directory contains purchased storage. In this case, contact BioHPC staff to help combine the storage accounts first.

      Permissions in shared directories: See this document for instructions on giving all group members read permission within the shared directory.


Data Safety and back-ups

BioHPC storage is NOT automatically backed up, there is only one copy stored of each file. It is each user's responsibility to make sure critical or irreplaceable data is mirrored or backed up to another physical location - keeping two copies of the same data on the same networked storage is NOT a proper backup. The backups are available as a separate service - details are available on this page.

Each storage array component of our network storage cluster is raidz3 (RAID7 equivalent in ZFS). On /home each file is localized, i.e. stored on one component server, therefore the total data safety for a file on /home is equivalent to a single RAID7 storage array safety level. On /home2 large files are spread among multiple component servers, therefore data safety of a large file on /home2 is equivalent to combined safety of several RAID7 storage arrays. For both /home and /home2, in practical terms, it means that a simultaneous failure of three hard drives in each of the component servers will NOT cause any data loss, and in fact will not even cause any data access disruption either. The health of disks are monitored constantly by BioHPC staff, and periodical scans are carried out to find and correct bit rot and other problems.

While this arrangement sounds safe, we cannot guarantee the safety of your data . We have not yet experienced a fourth disk failure leading to data loss on a Lustre server, but the probability of it occurring is not negligible. However, a much more common scenario of data loss is that a user deletes their own data by mistake. We therefore strongly recommend implementing a backup policy for important data.

 


 

Website credentials: login  Web Accessibility Help