Survey 2021: Summary of findings

Mid-december December 2021, CÉCI users were invited to respond to an anonymous satisfaction survey.

More than 95 users responded to the survey, out of approximately 750 users active in the past few months. They originated from all CÉCI universities, with very diverse research interests in science and engineering. We thank those users for the time they took to fill the survey.

The text below offers a summary of all comments and suggestions made in the responses. This other document offers a synthetic view of the responses.

Content

Account and connection

Multiple users requested to be removed from the CECI mailing list.

The CECI mailing list is used sparsly to announce disruptions of services on CECI infrastructure and CECI-organised events (training sessions, user meeting, surveys, etc.). On average, we send a dozen emails to that list, that is roughly once a month. It is very important that users are aware of such announcements so the policy is that users who request to be removed from the mailing list have their accounts cancelled.

One respondent complained that the rules for creating an account are not clear enough, especially when there are costs involved.

Some universities charge for the funding of the local team and local hardware, but that is independent from CECI. We can only advise reading the local websites and/or contact the local sysadmin for further information.

Several respondents mentioned finding it hard to create an account and connect

We have tried to make the documentation as through as possible, and have made the videos of the training sessions available on YouTube, but if you find parts of them that are not clear, do not hesitate to contact us to further improve the documentation. If you try to find your way without reading the documentation, and face issues, please read the documentation.

One respondent complained that access to the cluster was denied during the time it took to renew their account.

Account expires every year ; a warning email is sent with one month notice. That is plenty of time to renew an account. In case of problem, such as not being able to reach the website, or not getting the SSH key file, etc. you always have the possibility to contact the system administrators for help.

Several respondents complained that it is not possible to renew a CECI account from home.

Since the beginning of the Covid19 pandemic, we have made the CECI account management website available from all Belgium rather than only from the university networks. It still is the case at the time of writing this document. As to opening it at a larger scale, outside of Belgium, that is not advisable for security reasons.

One respondent complained that the policy regarding the SSH is strange, forcing users to write the password twice (once for the gateway, once for the cluster), while SSH keys were supposedly made to avoid typing passwords.

First off, the SSH protocol was always meant to protect the keys with a passphrase. Users who generate passphrase-less keys are unnecessarily deforcing the SSH ecosystem. But the fact that the SSH key is protected by a passphrase does not mean you have to type the passphrase every time (let alone twice) you connect. The written documentation and the training sessions videos explain it thoroughly. It is just a matter of configuring your SSH agent and storing the passphrase in your key ring.

Support and Documentation

Some users suggested improvements to the documentation: adding the names of the working nodes on the clusters, creating a lexicon, etc.

We will try to incorporate those suggestions in the documentation.

One respondent complained that the submission script wizard is sometimes out of sync with the actual cluster configurations.

The people maintaining this website are not always notified of changes in the clusters, do not hesitate to report any discrepancy you might find.

One respondent would like to have a support team for new users

The local system administrators can be reached for help with onboarding of new users. But such activities only benefit the group that made the request, and takes a lot of time. It is far more efficient for new users to read the documentation from beginning to end, and/or to watch the videos of the previous training sessions. That can be done without the need for a support team. Then, if there are questions, do not hesitate to send them by email. This is a much more efficient use of everyone's time.

One respondent suggested we create documentation and an introductory lecture about basic Linux commands

We organise every year a training session introducing Linux and the command line. As for documentation, there are plenty of Linux tutorials available on the web and writing our own would bring no added value.

One respondent suggested organising advanced training session after the introductory ones that take place in Fall every year, or giving advanced tutorials on multi-GPU tensorflow

The more advanced a tutorial becomes, the more time and expertise it requires to organise it, and the smaller the potential audience. The goal of the training sessions has always been to offer introductions so that participants can further learn by themselves more advanced aspects of the topics that they are more interested in.

Hardware

Several respondents requested to have more CPUs, less waiting time, longer walltimes, etc.

All this comes down to securing more budget for the CECI, which you can act on by including HPC budget in your project proposals. Given the resources we have, and the number of users that benefit from the clusters, we have no other option than to share the CPUs, and hence make turnover possibly by setting short maximum wall times.

One user would like access to a Tier-0 machine

Even though it is not in the plans of CECI to host such a machine, we have, jointly with VSC, a share of Lumi that we can allocate to our users. Be on the lookout on the CECI mailing list for details on how to request access.

Two Respondents asked for more GPU

Procurement for Vega 2 (Lyra) will start in 2022 ; we got confirmation at the beginning of 2022 that the funding was allocated.

Scheduling

One respondent complained that waiting times on Lemaitre3 are too long, another asked to have a way to get an estimation of the queueing time

The Lemaitre3 cluster has a priority formula that really favours large jobs. Small jobs are encouraged to be submitted to the debug partition. The squeue --start command gives the current estimated start time of your jobs, but it cannot predict if a user with a more favorable share will submit a job in the meantime.

Containers

Several respondents complained that singularity is not easy to use on the clusters

Singularity is a great tool, but not an easy to use one. Configuring the clusters so that users could build their images on the clusters directly would make to the configuration too complex to manage for many users. Using Singularity with MPI and Slurm still is a challenge nowadays, and even if we try to provide with ready-made images for some of the clusters, there is still much work to do. As for the problem with transferring large images from home to the clusters, one option is to build the image on the Singularity cloud as explained during the training sessions.

Job policies

One respondent suggested to make the job duration more different between Lemaitre3 and NIC5

This is something that we can discuss among system administrators and members of the CECI bureau.

Software

A user requested Crystal to be installed on NIC5

Please contact the cluster manager directly for such requests.

Storage

One respondent asked for the common storage to be available on the replacement of Zenobe

That is expected. The project for the new infrastructure includes hardware and network connection to enable that.

Several respondents asked for the limit on the number of files on NIC5 be increased

NIC5 has a parallel file system that is performant for heavy writing but not for zillions of tiny files. Actually, no filesystem is. That is why databases, object stores, archives, etc. exist. Small files really hinder the performances of the FS and all users suffer from it.

One user dreamed of clusters with infinite storage

There is a lot of space available on the global storage of each cluster, several hundreds of TB. But that is temporary storage. We cannot offer long-term storage solution as, contrarily to computing, storage cannot be time-shared.

Misc

A respondent complained that there was a timeout for the completion of the survey

This is the default configuration of the Limesurvey software, we will look into it for next time.

© CÉCI.