Survey 2022: Summary of findings

Mid-december December 2022, CÉCI users were invited to respond to an anonymous satisfaction survey.

The text below offers a summary of all comments and suggestions made in the responses. This other document offers a synthetic view of the responses.

Content

Documentation
Job policies
Storage
Suggestions

Documentation

One respondent suggested that typical submission scripts be available to copy from some storage place.

Some are already available from /CECI/proj/training/slurm/ on all CECI clusters ; we will mention this directory in the documentation and add some more examples.

One respondent reported that the documentation for launching a job the first time was not so helpful.

We updated that part of the doc and added some paragraphs in the beginning to better situate the context and introduce the notions.

Jobs policies

One respondent would like to be able to submit resource request that are adaptative to the available resources.

Slurm offer some flexibility in terms of number of nodes for instance: SBATCH --nodes=4-8 would request between 4 and 8 nodes, and Slurm will allocate as many nodes as available (but no more than 8) as soon as 4 nodes at least are available. That does not work unfortunately at the CPU level.

One respondent asked to have direct access to their data storage

Access to data stored somewhere else than on the CÉCI clusters does not depend on the CÉCI cluster configuration ; all frontends have access to outside networks, as have most compute nodes.

Access to data stored on clusters can be accessed directly using any SSH-based software tool.

One respondent complained that the maximum number of job per user forced them to submit multiple job arrays where only one could be sufficient

The limits that are set on the number of jobs per user are chosen to make sure the job scheduler (Slurm) can fairly allocate resources without itself consuming too much resources. A solution to alleviate those restrictions is to use a workflow manager. Guidance on how to choose and use a workflow manager can be found in the documentation.

One respondent complained about not having sufficient space in their home directory to run all the job they would like to run

The quota in the home directory are limited because the home directory are designed to hold software, configuration files, and small data files, but not large input files for jobs to consume or large output files produced by jobs.

Such large files belong on the global scratch filesystem, where there is no restriction. Note that they should reside there temporarily; when the jobs are done, they should be removed.

More details can be found in the documentation.

Some respondents hope to have access to more GPUs

Fortunately, now clusters users can get access to generous amounts of GPU resources by submitting projects to use the Tier-0 Lumi or Tier-1 Lucia infrastructures. In the coming months the future Tier-2 CÉCI cluster 'Lyra' will also offer a large number of GPUs.

Storage

One respondent complained that when a file is removed from one cluster, on the central CECI storage, it takes time for it to disappear on other clusters.

That is a direct consequence of the design of the central storage. The positive impact of that design is that write operations from a cluster are very fast and do not depend on the performances of the network connecting the universities. At the time that setup was provisioned, the existence and the performances of the dedicated network we are currently enjoying were uncertain, which prompted for such a asynchronous replication solution.

That system has reached end of life, and we are in the process of procuring a new system, that will enable synchronuous replication so that a file removed on a cluster will instantly disappear as seen from the other clusters. The write performances will suffer a bit, but we anticipate the differences with the current system to be limited.

One respondent asks whether the common storage could be organised like the home directories rather than allowing everyone to create files at the root of it.

The CECI home are currently organised like the local home directories on the clusters, with one directory per user. That was not the case on the TRSF directory, we recently setup a similar organisation on that partition too.

Suggestions

One respondent reported that the software they must use is available only with Docker and that the container could not be 'translated' to Singularity for some reason

Docker is indeed not available on the CÉCI clusters, but the future version of scheduler that is installed on the clusters (Slurm) will offer more and more interoperability with all sorts of containers (Docker, Podman, etc.) and be compatible with Kubernetes or Docker-compose types of workloads.

One respondent requested that users could change the login shell (e.g. to zsh)

The typical command to change the default shell chsh will not be effective on the CÉCI clusters because the login shell is part of the user record that is centrally managed in the common user directory (LDAP). Enabling the users to choose a login shell would require a modification of the login management system, and more administration to make sure all the proposed shells are indeed available on all the clusters, and that they are compatible and properly configured for lmod and the use of software modules.

The solution we suggest is described in the documentation. It involves adding two lines in the .bashrc file and is transparent for the user.

One respondent reports that sometimes the upgrade of software such as Python on the clusters may require the users to re-run installation/compilation of software for their job to run as before, which can be a lengthy operation.

The policy at CÉCI is to never "uninstall" software ; whenever a new Python version is installed for instance, the previous "module" still exist and can be used, unless they contain some critical security issues. Every year, the default release is changed, meaning that you get a different, updated, module when you load a module without specifying the version, but the previous releases are still available.

One user complained that sometimes the scheduled downtime are unfortunate with respect to the deadlines users might face for projects/publications

We try to organise the maintenance periods at times they have the least impact on users, but with around 500 active CÉCI users, it is unfortunately impossible to make sure everyone is OK with the schedule.

One respondent suggested that we offer not only introductory training sessions but also advanced sessions, on the use of GPUs for machine learning, for instance.

Unfortunately, organising advanced training sessions requires resources (time and expertise) that we might not have. Fortunately though, CÉCI users have access to training sessions organised at the national (VSC) and european level (PRACE and EuroCC), often free of charge.

https://www.vscentrum.be/vsctraining
https://training.prace-ri.eu
https://www.eurocc-access.eu/services/training/