Survey 2019: Summary of findings

From September 3rd to 30th, 2019, CÉCI users were invited to respond to an anonymous satisfaction survey.

The main questions were:

How did you learn about us ?
Was it easy to create an account, to connect ?
In how many publications have you already acknowledged the use of CÉCI clusters ?
What was the main problem you faced when you renewed your account?
How interesting to you was the reply to the previous survey posted on the CÉCI ?
What do you need ? (hardware/software/policy)
What is your typical job ?
What would your dream job be ?

The form ended with a free text field where users could leave suggestions or remarks. We got more than 30 comments, questions or suggestions.

More than 80 users responded to the survey, out of approximately 422 users active in the past few months. They originated from all CÉCI universities, with very diverse research interests in science and engineering. We thank those users for the time they took to fill the survey.

The text below offers a summary of all comments and suggestions made in the responses. This other document offers a synthetic view of the responses.

Documentation
Accounts creation & connecting
Job scheduling and management
Software
Available hardware
Data management
CECI Day
Trainings

Documentation

A respondant requested a `--output` option to be added to the Slurm script generation Wizard

Done.

A respondent states the common storage should be better explained in the documentation and better explained how to launch jobs from there

The documentation has a specific section about the central storage. We will augment it with submission scripts examples, and add that as a feature to the job submission script wizard.

A participant suggested to offer users guidance on which cluster to use for a given software

The user documentation already contains a list of software installed on each cluster. The CÉCI clusters page has a Preferred jobs column indicating for which kind is more appropriate, it also has a detailed description of the hardware present that should help you to infer which cluster would be the best suited for your code. For additional questions don't hesitate to contact us through the support page.

A user mentions the difficulty to find which information is relevant for their use case

There are so many different use cases it is impossible to describe them all in the documentation. Best option is then to contact the local support team. We recall that the documentation has on top of the left menu a search field, it can be used to look for keywords on the topic you need information.

A respondent asked to better explain how to add memory and time to the jobs. Another suggested to mention the `sreport` command in the documentation

This is explained in the Slurm documentation and we feel it would be redundant to copy that documentation on our pages.

A respondent stated that the FAQ and the Documentation would benefit from being in the same place on the website

Actually they were written for two distinct audiences; the FAQ is meant for non-users (possible future users, but also people with a management role rather than a scientific one) and contains mostly administrative and high-level information while the documentation is meant for users and contains detailed technical information.

We noticed however that an old FAQ about Slurm is still available on the website, and while it is not linked to in the main site, it is still referenced by Google and people still find it. It has been replaced with a redirection to the main documentation site.

A respondent stated that there was some overlap between the global CÉCI documentation and the local documentation specific to sites and that links to local documentation should be included in the main CÉCI documentation

A big effort has been put into the main CÉCI documentation which is a common effort from all CÉCI sysadmins, the local documentations only contain information specific to the local infrastructure. Links to local doc from the CÉCI doc has been reported many times to be confusing for users who do not have access to the local infrastructures.

A respondent requested that the scheduling policies on each cluster be more explicit

Scheduling is based on priorities. This page in the documentation explains already how to query the exact configuration of the clusters, and links to the proper documentation on the Slurm website.

A respondent requested a small cheat sheet with the most common Slurm commands

A cheatsheet can be found on the Slurm website: https://slurm.schedmd.com/pdfs/summary.pdf.

Account creation and connection to the clusters

A respondent wrote it was not easy to connect to the ULiège VPN being an Alumni

Unfortunately this is beyond the reach of the CÉCI system administrators. Help for other types of resources than HPC must be taken care of by your local IT helpdesk at your university.

Two respondents mentioned that they had issue renewing their account due to administrative issues related to the guest status or to billing

Such issues must be treated with the local administrators.

A respondent states to not understand at all how to connect, despite extensive prior experiences accross other clusters in the world

We can only encourage users to attend the training sessions and/or contact us to get help

A respondent drew our attention on the fact that renewing a CÉCI account from abroad is not easy because of the firewall restrictions

It is true that renewing from abroad takes additional steps, which are different from university to university.The solutions are often to use a VPN or an SSH gateway. We have no plan on reducing the security provided by the firewall for such issues. The best option is to renew your account as soon as you receive the expiration notice (one month in advance). And as always, do not hesitate to contact your local system administrators if you are stuck.

Job scheduling and management

Some respondents stated that they would like to run jobs for very long periods

Very long maximum allowed time are a blocker for cluster sharing. They increase the expected waiting time in direct proportions. Users are encouraged to investigate checkpointing options.

One respondent would like to have better estimates of the starting times of jobs

Unfortunately the estimate can only be computed based on known information. Estimates given by the squeue --start command are based on the requested resources ; for it to be precise, we need all users to be precise in their --time requirements. Jobs that crash also make estimates sometimes far from reality. But these only lead to overestimation of the waiting time.

Slurm has no way to know when users are intending to submit job. If your job is to start soon, but another job is submitted right now by a user with a better fairshare (because that user has not been using the cluster for a long time), your job will be delayed, and there is no way for Slurm to anticipate this until it receives the other submission.

Some respondents complained about the queue time for lemaitre3 to test large jobs

The priority setting on Lemaitre3 is specifically setup to favour massively parallel short jobs. Priority is directly proportional to the number of CPUs requested and inversely proportional to the requested duration.

One respondent would like to have guidance about how to submit their jobs

Users are always welcome to contact their local system administrators and ask for help. Just make sure beforehand to read the whole documentation and prepare precise questions.

One respondent noted that the features associated to the CPU like "Intel,Skylake5000,5118" are difficult to interpret and would prefer an option like minimum clock rate

Slurm does not offer a minimum clock rate option for jobs. As for the feature tags, we will try to think about less technical information, such as maybe release date.

Software

A respondent requested Gaussian16 to be installed instead of Gaussian09

We can only restate the fact that CÉCI only has funding for hardware. It is the responsibility of the user to provide licences for any software that requires them. If some users bring along a Gaussian16 licence, we will then be able to install it on some clusters, depending on the terms of the licence.

A respondent suggested to have cluster-wide Anaconda installations and what are the general guidelines

The general guideline for Anaconda is: Please do not use it on the clusters. It is meant for easy installations on a single-user laptop and is not well suited for multi-user clusters. But, more importantly, Anaconda distributes pre-compiled binaries in many cases, and otherwise uses pre-configured compilation options which are often not optimal for the CPUs we have on the clusters. Some experiments show that a factor of 2 can be observed in the FFTW library installed with Conda and the same library compiled with the right options.

Using software installed by Conda rather than properly compiled can lead to CPU time being wasted and waiting queues being longer.

A respondent requested that CÉCI offer more tools to improve performance of hybrid jobs

The CÉCI does not have any budget for software so this is a matter that must be discussed with the local system administrators.

A respondent asked about running Virtual Box VMs on the clusters

The clusters are not meant to run virtual machines both for performance and security reasons. The closest solution to a virtual machine that can run on the cluster is a Singularity container. Some sites might offer private clouds in the future where virtual machines could be started, but it is not the case at the time of writing.

Several respondents requested to have Singularity installed on every cluster

This is the goal we are working towards. Currently, if Singularity is not installed on a cluster, it means it is not compatible with that cluster. Singularity indeed requires version 7 of CentOS (the operating system) and cannot run on clusters installed with CentOS 6. As upgrading a full cluster is too cumbersome to be done in practice, we are installing Singularity as newer clusters are being deployed only.

A respondent asked for a 'transparent' way to run R and Python scripts on the clusters

For the future solution replacing the Vega cluster at the ULB, it is being explored to offer a platform where R and Python codes could be submitted and run on a cluster using just a web interface as Jupyter notebooks, without requiring any kind of ssh access or terminal interaction. This will help to decrease the barrier of going from your local R or Python workflow in your workstation, to running them on the more powerful hardware available on a cluster.

Available hardware

A respondent requested that CÉCI buy more GPUs

Very high-end GPUs (worth 7 to 8k€ each) have been installed in Dragon2, and some more are planned in the upgrade of Vega.

A respondent complained that Lemaitre3 is often unstable

The Lemaitre3 cluster is indeed sometimes unstable, with the causes of the instability very difficult to find. The team is constantly looking to improve the situation. The latest attempt in that direction is to repurpose one of the compute nodes into a login node to avoid interference between the users sessions and the cluster management software.

Data management

Some respondents requested that the CÉCI offer long-term storage

Long-term storage requires constant and recurring funding, which CÉCI does not have. Long term storage is thus the responsibility of each member university.

A respondent pinpointed the importance of local disks for some software

We try to have a local disk on each compute node, check the CÉCI clusters page for the presence of a $LOCALSCRATCH area on the cluster description, but the size might be small for budgetary reasons.

A respondent stated that the CÉCI home should be increased in size and that is sometimes is slow.

The common CÉCI storage capacity is limited by the amount of funding we can get. There is currently no plan for funding this system further. As for its speed, it cannot be as performant as a local HOME given it is replicated across Wallonia on all clusters. As any other HOME its usage must be limited to storing scripts and input files, which should be copied to the local GLOBALSCRATCH/LOCALSCRATCH of the cluster where the job is submitted to perform there all the job's I/O, after the job is done you can copy back the useful output files. About Manneback being particularly slower, it can be the case as it's connected through a less performant network interface than those of Lemaitre3, Hmem or Nic4.

One respondent requested that the message of the day (greeting upon login) should display more useful and accurate quota information.

This is something we are working on and some issues have been fixed already. Please feel free to contact us through the support page to report any specific inconsistency you might notice on a cluster.

A respondent asked how to sftp to CECIHOME and how scratch space works

Simply using the URI sftp://<any cluster>///CECI/home/university/department/username should work (note the triple /). As for scratch spaces, local scratch space are often cleaned after the job finishes so the data created there should be moved to the global scratch and then copied back to their final destination (user laptop, mass storage, etc.) from there.

A respondent mentions it is not clear which are the policies for the GLOBALSCRATCH/LOCALSCRATCH spaces

On the point 3.3 of the FAQ or in the Disk space section of the documentation are outlined the policies, but we acknowledge they are not fully clear and we'll work on clarifying the statements.

As a general rule the data storage on $GLOBALSCRATCH areas is always persistent, it can be deleted during maintenance periods or when the free space is becoming low. In the latter case, we will ask you to do the cleanup in your area if you are using too much.

CECI Day

One respondent wrote that scientific presentations at the CÉCI day were nice but too focused to be understandable by the whole audience.

Every year we try to encourage the speakers to not simply recycle one of their conference presentations and produce material aimed at the global CÉCI audience. We also are incorporating more and more technical presentations that are of general interest.

Trainings

A user requested that some training sessions be organised more than once a year

We currently do not have enough resources for that, but we are working towards integrating the CÉCI and VSC training sessions into a single offer, so that multiple sessions on the same subject might be offered starting in 2020.

A respondent asked when one should add a 'wait' keyword at the end of their submission script

The answer is whenever one of the commands in the scripts ends with an ampersand (&).

A respondent suggested to make the training sessions mandatory to future users

We do our best to encourage users to attend the training sessions. Time is much better spend explaining a concept to 20 people at a time than repeating the same explanation 20 times in 1-to-1 coaching sessions. But we cannot go for enforcing them as many new CÉCI users are already experienced HPC users.