Survey 2024: Summary of findings
At the end of year 2024, CÉCI users were invited to respond to an anonymous satisfaction survey.
The text below offers a summary of all comments and suggestions made in the responses. This other document offers a synthetic view of the responses.
Content
Survey 2024: Summary of findings
At the end of year 2024, CÉCI users were invited to respond to an anonymous satisfaction survey.
Documentation
Some users comment that finding information to the documentation can be difficult.
The CÉCI has revamped its documentation framework. It now includes a more dynamical search engine that hopefully helps users find answers. Also, the documentation tries to reflect elements that are related to the CÉCI common usage or cluster specificities. If the cluster uses external libraries, you may refer to its associated libraries.
Two respondent complained that the information is not targeted at beginners.
The documentation tries to maintain the right balance between the amount of details in the documentation and the clarity of the documentation. When the documentation contains too many detail, people who are already knowledgeable find it difficult to retrieve the pieces of information that they need in large texts. By contrast, when the documentation only focusses on the necessary information, people with a less technical background complain that they do not understand it.
There is unfortunately no best one fit all solution to that problem.
Three respondent are confused about the various documentation websites and why certain documentation does not specify certain clusters.
The CÉCI consortium enable users in all universities with access to all the CÉCI clusters of each university, but there are resources in some universities (e.g. Manneback in UCLouvain), or other entities (e.g. Lucia in Cenaero) that are not available to all CÉCI users but still use the CÉCI login and SSH keys. These infrastructure have their own documentation website, distinct from the CÉCI documentation.
As a user, it is important to know and understand which entity manages which resource, because it helps finding the right documentation and support teams, but it also enables proper acknowledgement in the publications.
One respondent suggested that the documentation explains how to install the dos2unix command that is suggested to solve the problem most users face with their SSH key being tampered with by Outlook/Exchange.
The correct procedure to install dos2unix depends on the operating system, and even, in the case of Linux, on the Linux distribution. There is not single universal method. So rather than adding yet more complexity to the documentation, we have decided to give multiple options with often pre-install commands with the hope that at least one of them is already installed on the users' laptop. At the time of writing, we offer three alternative solutions based on col, tr and sed.
Several respondents suggested that we offer a guide to determine what resources to ask for in a job submission script.
What resource to request is an important aspect of a submission script. Unfortunately, that is also the most challenging. It depends both on the software, its parameters, and possibly the data it is applied to. It is therefore impossible to write generic recommendations that would be valid in all cases. The Slurm training sessions offer hints and tips, which we will try to incorporate into the documentation in the near future.
Connecting
One respondent suggested that the "project" field be more clearly marked as optional in the form for account creation.
The tooltip currently reads "Projects allow access to supplementary resources", but we will try to make it explicit that not everyone needs access to supplementary resources.
One respondent reported that the SSH configuration wizard is not easy to find.
The SSH client configuration Generation Wizard can be found in the "quick links" section of the website and from the page in the documentation that explains how to connect to the clusters.
One respondent explained that it is difficult to follow the instructions when something does not go according to plan.
It is true that unfortunately the documentation cannot be precise about every problem that users can face, even though we list the most common on the document. This is why we still have training sessions every year where the system admins are there to help setting up the laptops of the participants.
Multiple respondents reported that Outlook damaged the SSH key.
This is a problem that is now taken into account in the account management web application so it should not happen anymore.
One respondent complained that at every renewal, one month is lost when the account is renewed at the time the notification is sent.
This aspect is indeed weird, but stems from the way the account management application was written. This is different in the current version, where the validity period is not reduced, and might still change in the future to better keep the "anniversary date" of the account.
One respondent complained about the need to be in Belgium to be able to renew their account rather than using, in this case, the ULB VPN.
The requirement to be in Belgium to be able to access the CÉCI account management website is not to be taken literally. If you are using a VPN, you will be able to access it. If it is not the case, please contact the CÉCI support so we can investigate.
A respondent complained that, upon account renewal, which is notified to the user one month prior to expiration, they did not receive the full key pair, only the private part, while at sign up they did receive both.
This is both correct and incorrect at the same time ; the process is the same for the initial account creation and the subsequent renewals (for which the user is warned a month in advance) : only a file with the (encrypted) private key is sent by email. But the public key can be deduced from the private key ; that is explained in the documentation here (step 3).
One respondent complained that on some cluster, if you stop using it for a few minutes and return to the session, it is locked.
If you find your terminal frozen when you come back to it after a while, it often is because of a loss of network connectivity, that can come from, for instance, the WiFi connection of the laptop entering sleep mode to increase autonomy.
Jobs
A respondent deplored that they were unable to install custom software.
Installing custom software on the clusters is typically done by recompiling the software. Recompiling is important because it is often the only way to most optimally use the hardware and reach the highest performances, and is often also the only option for regular users without root or sudo privileges. Installing software by yourself is explained through the course of multiple training sessions and is addressed in the documentation.
Multiple respondent requested that longer runs are allowed on the clusters, while multiple others hope that their jobs spend less time in the queue
There is unfortunately a direct relationship between the waiting time and the maximum allowed time ; increasing the latter automatically increases the former. Fortunately, there are tools and techniques to 'split' a long job into a series of smaller jobs, leading to a better sharing of the resources in the long run. They are summarized in this video.
Two respondents requested the ability to submit much more jobs at once, while another respondent complained that some users submit too many jobs at once.
To the former, we can respond that there are solutions to manager large number of jobs without overloading the Slurm scheduler, as described in the Workflow management software section of the documentation.
And we will encourage the latter to trust the system and still submit their jobs. The fairshare will decrease rapidly for the users with many jobs, and your jobs should be able to start within a reasonable time.
One respondent requested that more GPUs be available on the clusters.
The survey was run in December 2024 ; since then the Lyra cluster has entered production, offering more than 40 GPUs. Please also note that the Tier-1 Lucia cluster is also equipped with GPUs. Users who have use case for large GPU usage should therefore consider submitting a Tier-1 project to get access to its 200 Nvidia A100.
One respondent requested that a few, powerful, nodes be dedicated to visualisation.
Visualisation nodes often require large memory and powerful GPUs, and fast network. Often, it is not reasonable to include them in the budget for a Tier-2 cluster.
One respondent requested more cores per CPU
The compute nodes of Lemaitre4 already offer 128 real cores per node, which is already large a number. Unfortunately, even though CPUs with even more cores are available on the market, they are neither cost-effective, nor suitable for the kind of hosting that the universities currently can provide. In the future, though, when the datacenters will have migrated to Direct Liquid Cooling, such very-high density CPUs could be installed in CÉCI clusters.
Storage
The storage quota applied in home directory is not sufficient to some users.
Quota allows a reasonable usage of the cluster storage capacity. However, each researcher has its own needs, and upon request, we can adjust the quota for an individual. (explain that it is also important that users use the appropriate space to store data. The home directory is not designed to store data)
The number of files quota applied in home directory is not sufficient to some users.
File quotas allows a reasonable usage of the cluster responsiveness. However, each researcher has its own needs, and upon request, we can adjust the quota for an individual.
For large number of files related to python environments, good behaviors exist to limit the number of files. This implies loading appropriate modules before installing missing libraries (see here and here). Also, we strongly discourage the use of Anaconda/Conda.
One respondent complained that the common storage does not include the Tier-1 cluster LUCIA.
(modify according) Since XX/XX/XX, the Tier-1 cluster LUCIA is connected to the common storage, enabling easy access from all Tier-2 cluster to LUCIA.
One respondent reported that transferring files to localscratch is slow
The local scratch filesystems can be based on different technologies from cluster to cluster so timing can really differ from one cluster to another. Make sure to choose the cluster that best suits your needs.
Some users reported that transferring data using the older common file system (ex: $CECITRSF) was slow and asynchronous
There is no longer any transfer space (/CECI/trsf) on the new CÉCI Common Storage. The new storage is not asynchronous like the old one. All files and their content are immediately visible on all clusters. More info on this page.
One respondent complained that the quota are tight in the home directories and keeping stuff in the scratch is not ideal.
Indeed, the scratch space are not meant to store files in the long term, and the home directories are not designed to store large files. For those use cases, the CÉCI common storage might be used, depending on the specific use case. See more information in the documentation.
Suggestions
Many items in this section must be rerouted to some other section (training, Doc, Jobs, etc.)
Two respondents requested that all nodes have outbound access to the internet
Some clusters allow outbound access to the internet from the compute nodes, others do not. This policy is decided locally in each university.
Two respondents were unsure about how to ask for help
This topic is important as the way help requests are written has a huge impact on the time and energy needed to answer it. The topic is covered in the documentation, in a short video on Youtube, and every year in a specific training session.
One respondent requested that some commercial debugging software be installed on the CÉCI clusters
Unfortunately, the CÉCI has no budget for software at all. The clusters are operated under the assumption of "bring your own licence". Should any user be willing to pay for such software, we would be able to install it.
Multiple respondents requested that some software be installed on the clusters
We will take this opportunity to remind that users are most of the time able to install software by themselves, but of course the sysadmins are always there to help if needed. Just make sure to contact the correct team for the cluster you are using.
One respondent suggested that we offer training on bioinformatics analyses
This is unfortunately out of the scope of the CÉCI ; we focus on topics that enable working on the clusters, but the actual science made on the cluster is out of our reach.
Multiple respondent requests some software web-based software to be installed on the clusters and easy to connect to.
The documentation explains how to start such software on the clusters and connect to it using SOCKS proxies or sshuttle. But the clusters are not designed to offer web-based interfaces as those really are meant for interactive usage while by contrast, the clusters are meant for batch use. Interactive use of the clusters will always lead to resources being sub-optimally used which is really detrimental for such equipment. Interactive computing is better fitted for other types of infrastructures.
One respondent suggested that professors be more involved in the dissemination of information about the CÉCI and introduce the use of clusters in their classes
From our contacts with the professors, it appears that many of them who have master students that need the cluster do so. For the others, we have a set of slides ready for them to use if they wish to.
One respondent suggested that CÉCI publishes a newsletter every quarter or so to inform the users about the new (and old) possibilities.
A newsletter is a good way to distribute information to the users, announce new changes, etc. One drawback though is that many users are already complaining that too many emails are sent to the mailing list, even though we only send emails to announce the training sessions, new clusters, and the User days. So we fear the newsletter might just be ignored, and that even more users would request to unregister from the mailing list.
CÉCI Training {@id=training
Multiple respondent stressed the fact that practical sessions are more useful than the purely theoretical ones.
People often understand and remember better when doing rather than being shows. That is why we are currently working to develop exercise for most of the sessions in the forms of adventure text games, based on the Gameshell engine.
One respondent noted that it is not easy to know which training sessions would be most relevant for their need
It is indeed not easy from the title only to know whether or not a training session will be useful given one's background and prior knowledge. That is why we try to include slides from previous years in the description of the sessions on the Indico website where participants register. This way one can have a glimpse of the contents at the time of registration.
One respondent reported that the order in which the videos in the playlist on Youtube are not in the natural order
This is something we will try and fix.
One respondent suggested that there be a session about what CÉCI can and cannot do, and things important to know
That is indeed important and we try to have that information conveyed during the very fist two training sessions, and also during the CÉCI User days.