Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Organize the storage on the cluster #77

Open
gkaf89 opened this issue Jul 25, 2024 · 1 comment · May be fixed by #96
Open

Organize the storage on the cluster #77

gkaf89 opened this issue Jul 25, 2024 · 1 comment · May be fixed by #96
Assignees
Labels
enhancement New feature or request

Comments

@gkaf89
Copy link
Collaborator

gkaf89 commented Jul 25, 2024

Storage tiers

The storage is organized across multiple tiers. The distinguishing characteristics for the tiers are:

  • speed (throughput and latency),
  • size,
  • accessibility (temporal and locational persistency), and
  • robustness (redundancy and back-ups).

Usually

  • speed is inversely proportional to size, robustness, and accessibility, and
  • size, robustness, and accessibility are proportional to each other.

Only low speed storage (i.e. the Isilon NFS mount) will be accessible to all clusters in the future. Thus, Isilon will become crucial in the future in maintaining uniform data access across all clusters.

File systems accessible through the HPC Infiniband network

The HPC file systems are meant to store working data, and are not meant for long term storage. The scratch file system and project directories store large temporary input/output files, the home directory is meant for working storage, and then we have local file systems accessible through /tmp (local persistent memory) and /dev/shm (virtual memory) that are fast, available in jobs, and wiped out when the job finishes. Finally project storage is meant to store finalized input and output files.

However, there are file systems that are accessible through slower network connections and offer different kinds of features.

File systems not accessible through Infiniband

The central university storage is slower, but snapshotted and backed up much more regularly. Therefore users should transfer their data to the central systems for long tern storage.

However, there are multiple options of accessing the central university storage. There are the systems Atlas, Ebenezer, Isilon-DMZi, and Isilon-DMZe.

  • What is the difference between Atlas, Ebenezer, and Isilon?
  • What is the difference between Isilon-DMZi and Isilon-DMZe?
  • How are user quota managed in central storage systems, and how can users see the usage limits?

The Isilon file system

Isilon is actually the name of the technical solution: https://www.dell.com/fr-fr/dt/storage/isilon/isilon-h5600-hybrid-nas-storage.htm#scroll=off

There are 2 central storage servers to Hyacithe's knowledge, which are operated by the SIU, the "isilon-prod" and "isilon-drs" (off site replica of "isilon-prod", in case of disaster on "isilon-prod").

The isilon-prod is split in (at least) two zones:

  • the SIU zone, that accessed using SMB via atlas.uni.lux, and
  • the HPC zone, that is mounted in the clusters with NFS and can be accessed in /mnt/isilon.

For the HPC side, we are on;y interested about the NFS mounted file system. Documentation about Isilon: https://hpc-git.uni.lu/ulhpc/sysadmins/-/wikis/storage/isilon

  • The processes for the HPC zone are not well defined or documented. We can set up quota per project directory, but there's no way to show this information to the users. We are working on providing users with access to this information and setting up a policy for assigning quota.

  • We share the Isilon system with the SIU. There is a "fair use agreement" in place which allocated 2PB for the HPC zone, currently used at 88% of the full capacity. Maintaining access to the Isilon system is important moving forward, as the Isilon file system will be the only system unifying data access across our future clusters. We should participate in any future calls and coordinate with SIU.

  • In terms of performance, performance is abysmal with small random I/Os, for instance small files, metadata, etc. The Isilon NFS mount works well for administrative needs, like archiving and occasional data transfers, and even for big file I/O. But don't try to perform any compute driven operation on NFS mounted Isilon, like compile a software on it, or anything similar.

The Atlas file system

The SMB protocol allows for easy mounting of file systems on personal computers, including Windows machines.

The HPC team is not managing the file system exported through SMB from Atlas (atlas.uni.lux). However, the HPC team maintains the smb-storage script (under active development) that allows mounting SMB shares on the login nodes of our clusters.

Fun fact: you can access the HPC zone via samba on your workstation using your Active Directory credentials. This works via a fragile script to map windows/POSIX permissions and user accounts from the HPC-IPA to the SIU Active Directory. This was requested by LCSB Bio-core in 2014. The system still works but it is no longer supported. Honestly, if you are using linux you can get the performance of SMB with SSHFS: https://blog.ja-ke.tech/2019/08/27/nas-performance-sshfs-nfs-smb.html

Add some instruction on how to fix errors in access permissions

The discussion of data management is a bit unorganized. We should probably reorganize the sections and add some information on how users can fix their projects when errors occur.

To fix access permissions in a project directory,

  1. change ownership,
chown -R :<project name> /work/projects/<project name>
  1. and then change access rights:
find /work/projects/<project name> -type d | xargs chmod g=rxs

Also, add a link with more resources: https://www.redhat.com/sysadmin/suid-sgid-sticky-bit

@gkaf89 gkaf89 added the enhancement New feature or request label Jul 25, 2024
@gkaf89 gkaf89 self-assigned this Jul 25, 2024
@gkaf89 gkaf89 changed the title Add some instruction on how to fix errors in access permissions organize the storage on the cluster Aug 11, 2024
@gkaf89
Copy link
Collaborator Author

gkaf89 commented Aug 11, 2024

Excerpt from ticket:

You also need to make sure that all files created in your directories will have the correct permissions. Try this command:

find /work/projects/covalux/scratch_lschramm -type d | xargs chmod g=rxs

This command sets the sticky-bit (https://www.redhat.com/sysadmin/suid-sgid-sticky-bit) in your directories. You will see in the ls -la command that directories will change from drwxr-xr-xto drwxr-sr-x.

Setting the sticky bit in a directory ensures that all files created in the directory will inherit the group of the directory. See: https://www.redhat.com/sysadmin/suid-sgid-sticky-bit

In the future make sure that files and directories your create in projects have the correct permission. Remember, in project directories the quota are computed per project group (covalux in your case). Cluster users (clusterusers) have 0 quota in the project directory, so any complaints about insufficient storage may also be caused by incorrect user groups.

@gkaf89 gkaf89 changed the title organize the storage on the cluster Organize the storage on the cluster Aug 11, 2024
@gkaf89 gkaf89 linked a pull request Dec 5, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant