BioHPC Storage
Contents:
On the Linux workstations, you have access to two kinds
of storage:
-
Networked storage is mounted on all BioHPC servers
under /home, /home2, /programs, and /shared_data.
Each user is given a modest amount of free network storage to use in /home and /home2,
with additional network storage available for purchase.
-
Local storage is storage physically attached to
a specific BioHPC server, and is mounted at /workdir and /SSD.
Local storage is available to a user for the
duration of a reservation (or permanently for group members of hosted machines).
BioHPC storage (network or local) is not backed up by default.
We strongly encourage all users to develop and
implement a backup plan. BioHPC does provide backups as a separate
service -
details are available
on this page
.
Networked Storage Overview
Currently BioHPC Cloud currently has a network storage system of size 2,766TB (2.7PB).
The storage is implemented as two Lustre clusters: the cluster mounted at /home has capacity 2,012TB,
and /home2 has 754TB capacity. There is a limited free storage available to each BioHPC
user (see Free storage below).
Any user can purchase extra storage for $105.00 per TB per
year- this is one of the lowest storage prices available anywhere. We
can offer this low price since we buy the storage in big chunks and we
only recover the cost as it is, i.e. hardware, computer room, and
maintenance costs.
Home and Home2
Your home directory is part of the network storage system and is located at /home/username,
e.g., /home/abc123. This is where you end up right after you log into a machine, and where you should keep your configuration files.
For every top level directory in /home, there is a
directory in /home2 with the same name and permissions. For
example, in addition to user abc123's home directory /home/abc123 there
is also a directory /home2/abc123, ready to be used by anyone having
sufficient access rights. For each group directory /home/mylabgroup
there is
also /home2/mylabgroup directory with the same permissions and ready to
be
used.
Both /home and /home2 are network-mounted high-performance
file systems, but /home has been optimized for safety, whereas /home2 has been optimized for speed.
Both /home and /home2 can be accessed from all BioHPC machines.
The differences between the directories located in
/home and
/home2 are:
-
You
may NOT run computations directly on data
residing in /home. However, you may do computations on files in /home2. Alternatively, you can copy files to local storage before computing.
-
/home
has been configured for data safety,
whereas /home2 has been configured for maximum access speed. The
difference in
safety is fairly small overall. On /home2, large files are spread out
among
multiple storage units, allowing for fast parallel data transfers,
whereas each
file in /home is localized on one storage unit.
-
Your
home directory resides in /home, this
is where you will start when you log in, and you should keep user
configuration
files in /home.
Any quota applies to the combined storage of both
/home
and /home2 directories. If you have paid storage, then the usage applies
to the
combined storage of both /home and /home2 directories. To start, /home2
directories are empty, so they do not influence your balance until you
deposit data
in /home2. Data located in /home2 directories can be backed up using the
standard
BioHPC backup system but is not backed-up by default.
Our suggested strategy for utilizing both storage
systems is
to use /home as long-term storage, and use /home2 for computing and
short-term
storage. The advantage of using /home2 for computing is that you don't
need to
copy data to /workdir and then copy the output back. Please note that
local storage
(/workdir or /local/storage) is still faster than /home2 storage,
especially on
modern servers equipped with SSD and NVMe devices.
Local storage
Each machine has local storage located
in the directory /workdir (regular disks) or /SSD (SSD storage on selected
workstations). The actual amount of storage available is listed on the
Reservations page under the machine name (e.g. "4TB HDD; 1TB SSD" means
4TB regular disk storage under /workdir and 1TB of SSD storage under
/SSD). When logged into a machine, you can check the amount of
used/available storage on the command-line by executing a command like
"df -h /workdir".
After logging in, each user should create his/her own subdirectory
under /workdir or /SSD (e.g., /workdir/abc123) and put all the files to be
processed in that subdirectory rather than in the home directory. When
launching an application, make sure that it always reads/writes files
from/to local disk - this is usually accomplished by executing a "change
directory" command similar to "cd /workdir/abc123" and starting the
application from there.
Note: The
directory /workdir (and its subdirectories) are local to each machine,
i.e., /workdir on cbsuwrkst2 is not accessible from cbsuwrkst3,
etc.
Local storage automatic deletion (rental servers only)
When your reservation ends, the contents of /workdir and /SSD
may be wiped out automatically to make space for the next user's data.
Therefore, any important files (calculation results, for example) need to
be transferred to your home directory before you log out.
Files in /workdir or /SSD directory should be
transferred out of the workstation and deleted after the computations are
done. Unfortunately many users leave these files behind creating disk
space problems for other users in the future. To prevent this we have
implemented an automated cleaning procedure that removes old files from
/workdir at 3:00am every day. The rules for removing old files are:
-
Files of the current reservation are NEVER deleted
-
FIles for reservations that ended more than 7 days
ago are ALWAYS deleted.
-
If there is more than 50% of free disk space
available, files of the 2 most recent previous reservations are not
deleted in addition to the current reservation (if any).
-
If there is less than 50% but more than 10% of free
disk space available, files of the 2 most recent reservations are
not deleted (including current reservation, if any)
-
If there is less than 10% of free disk space
available, files of the most recent reservation are not deleted
(including current reservation, if any).
If you need to clean
/workdir at other time than 3:00am you can do it by running script /programs/config/clean_workdir
- it will start the same procedure as is run periodically at 3:00am.
Transferring and sharing data
Networked
storage (/home and /home2) is available on all workstations and login nodes. Data can be
transferred to and from networked storage without any reservation, all
that you need is an active BioHPC user account. The best way to transfer
data is to use scp or sftp protocol (common Windows client is
FileZilla). For step-by-step explanation of data transfer, please refer
to "Access".
You can share your files with other BioHPC Cloud users
by setting file/directory permissions. For external users (without a
BioHPC account), you can share data via Globus (Using
Globus to Share Data), or you can
create
a temporary guest account
.
Members of the Molecular Biology & Genetics and Entomology
Departments have access to networked storage space paid for by their
respective Department. Other departments and groups have hosted file servers with mountable network filesystems.
Instructions for how to mount (map) this storage
to your PC or Mac are available in this document
.
Network storage free allocation and quotas
All users are granted a modest amount of free storage
for their home/home2 directories, to ensure that they can perform basic
operations without needing to purchase storage. The amount of free
storage is determined as follows:
-
Users associated with active BioHPC credit accounts, hosted servers,
or who have purchased additional paid storage receive 200Gb of free
storage for their home directory.
- All other users receive 20Gb free storage.
Quotas are set for home, home2 directories and paid storage
directories. All quotas are soft: this means that you are not
prevented from writing to the filesystem when the quota is exceeded.
-
For unpaid home directories,
your quota is either 200Gb or
20Gb (according to your free storage allocation, see above). If you exceed the quota, you will
receive frequent email notifications and are expected to address the
issue promptly. If you do not remove excess storage or purchase
storage credits, after some time your account will be locked, and
eventually your data will be deleted.
-
For paid storage, you can set your own quota.
These
quotas are informational only.
When you exceed the quota, you
will receive a notification by email, and then your quota will
automatically be increased. In this case, the quota can be thought of
as a 'warning threshold', it is designed to help keep you aware of how
much storage you are paying for.
-
To change your warning threshold, go to
My
Storage
page and click the 'Add or modify storage' button
under the appropriate directory. You can choose to purchase 0
units and change the Warning Threshold only. The threshold must be
higher than the amount of storage you currently have.
-
By default, quota warning emails for group storage are sent to
all
group members
. To modify this,
conatct
BioHPC staff
.
Checking storage usage
You can check your network storage usage and storage credit balance (for
purchased storage) on the My Storage page,
balances are updated once daily. For purchased storage, you will also
see an 'expiration date': this is just a calculation of when you will
run out of credits if your storage usage does not change.
You can also check your storage from the linux command line while
logged into any BioHPC machine. While the traditional command "du" may
be too slow for networked directories with many files, an in-house
command "lfs-du" (Large-File-System du) is much faster and can be used
to see the size of any files or directories owned by members of your
group. The command simply takes a list of files or directories, and
returns the total size of each argument. Unlike the
My
Storage page
, where results are updated daily, the lfs-du command
provides almost real-time results (may be several seconds delay for file
system changes to be reflected in lfs-du results).
For local storage, the size of a directory can be assessed with the 'du' command. For example,
'du -sh /workdir/abc123/mydir' will give the total size of the directory /workdir/abc123/mydir. A command like:
'du -sh /workdir/abc123/mydir/* | sort -h' will get the size of each file/directory inside /workdir/abc123/mydir, and sort
results by size.
Purchasing storage
If you need more storage than the free allocation, you
can purchase storage credits. The credits can be applied to your home directory, or to a shared group directory.
-
Storage rates and charging
Storage may be purchased on the
My
Storage page
using Cornell Account, Credit Card, or with a
Purchase Order (for pre-authorized users only). Storage credits are
purchased in units of 'Terabyte-years', at a cost of
$105.00 per Tb-year (sold in whole units only). At
BioHPC, you only pay for storage you actually use. For
example, 1 Tb-year of storage will purchase 1 Tb storage for one
year, or 2 Tb for half a year, or 0.5 Tb for 2 years. Your storage
credit balance is updated every day based on a daily snapshot of
your actual usage.
If you run out of storage credits, you will be informed by email
and asked to purchase additional credits. You will continue to
accrue a negative balance, unless your storage usage falls below
your free storage allocation. If you do not address a negative
balance promptly, your account will be locked, and eventually your
data will be deleted.
-
Purchased storage directory
There are two options to consider when you purchase storage:
-
Purchase storage for a home/home2 directory: This can
be done by navigating to My Storage Page,
and clicking on 'Add or modify home directory storage'. With this
option, you will only be charged for usage in your home directory
that exceeds your free storage allocation.
-
Purchase storage for a shared directory: This is a
good option for shared storage space within a lab, or for a
group project. To get started with this option, you need to contact BioHPC
and request the creation of a storage group. You will need to
chose a group name and a group owner. If the name of the group
is abc123Lab, then the storage directory will be /home/abc123Lab
(and /home2/abc123Lab), and the group owner will be able to add
or remove group members by navigating to the
My
Groups
page. All group members will see the directory
listed on the My Storage Page and
can purchase storage credits for it, using the link 'Add or
modify abc123Lab storage'.
Home directories in shared directories: It can be
convenient to move the home directories of users to a shared
directory with purchased storage. This way, only one storage
purchase is necessary to account for the storage space used by
multiple users. Additionally, it can keep home directories of
group members organized in a single location. Moving home
directory applies both to /home and /home2 directories.
-
Group owners can move a group member's home to group
storage by going to My Groups,
clicking 'Group users', and then there is a link for each user
to move (or remove) their home to group storage.
-
Home directories are moved by creating a symbolic link from
/home/username to /home/abc123Lab/username. In this way, the
move is fairly invisible to the user, and the user can access
their home at either of these paths.
-
When a user's home is moved to paid group storage, they
still receive up to 200Gb free storage for their home
directory. The size of the user's home, up to the user's free
storage allocation, is subtracted from total group storage
usage each day before the group storage credit balance is
updated.
-
Note: it is not possible to move a home directory to group
storage if the home directory contains purchased storage. In
this case,
contact
BioHPC staff
to help combine the storage accounts first.
Permissions in shared directories: See
this
document
for instructions on giving all group members read
permission within the shared directory.
Data Safety and back-ups
BioHPC storage is NOT automatically backed up,
there is only one copy stored of each file. It is each user's
responsibility to make sure critical or irreplaceable data is mirrored
or backed up to another physical location - keeping two copies of the
same data on the same networked storage is NOT a proper backup. The
backups are available as a separate service -
details are available on this page
.
Each storage array component of our network storage cluster is
raidz3 (RAID7 equivalent in ZFS). On /home each file is localized, i.e.
stored on one component server, therefore the total data safety for a
file on /home is equivalent to a single RAID7 storage array safety
level. On /home2 large files are spread among multiple component
servers, therefore data safety of a large file on /home2 is equivalent
to combined safety of several RAID7 storage arrays. For both
/home and /home2, in practical terms, it means that a simultaneous
failure of
three hard drives in each of the component servers will NOT cause any
data loss, and in fact will not even cause any data access disruption
either. The health of disks are monitored constantly by BioHPC staff,
and periodical scans are carried out to find and correct bit rot and
other problems.
While this arrangement sounds safe,
we cannot
guarantee the safety of your data
. We have not yet experienced a
fourth disk failure leading to data loss on a Lustre server, but the
probability of it occurring is not negligible. However, a much
more common scenario of data loss is that a user deletes their own data
by mistake.
We therefore strongly recommend implementing a backup policy
for important data.