Survey 2016: Summary of findings

From June 27th to July 12th, 2016, CÉCI users were invited to respond to an anonymous satisfaction survey.

The main questions were:

How did you learn about us ?
Was it easy to create an account, to connect ?
In how many publications have you already acknowledged the use of CÉCI clusters ?
What was the main problem you faced when you renewed your account?
How interesting to you was the reply to the previous survey posted on the CÉCI
What do you need ? (hardware/software/policy)
What is your typical job ?
What would your dream job be ?

The form ended with a free text field where users could leave suggestions or remarks. We got 17 comments, question or suggestion and those who left an email address were contacted.

More than 80 users responded to the survey, out of approximately 470 users active in the past few months. They originated from all CÉCI universities, with very diverse research interests in science and engineering. We thanks those users for the time they took to fill the survey.

The present document offers a summary of all comments and suggestions made in the responses. This other document offers a synthetic view of the responses.

Acknowledgement in publications
Documentation
Home directories
Support
Resources
Job scheduling

Acknowledgement in publications

This year we added this new question in the survey. All the respondent have cited the CÉCI at least in one publication with a total of 143 acknowledgements.

The acknowledgments is the most direct way to show the utility of clusters. These testimonials are very useful for getting funding to ensure the project continuity and to give access to computing power to researchers.

Documentation

Some respondent complained about the difficulty to get information about modules and how to interpret the meaning of the tool-chains

The basic notions of module and tools-chains are introduced in the CÉCI FAQ 2.7. Which software is available ?. More details can be found in the tutorial Installing software on the CÉCI clusters.

CÉCI clusters administrators install the programs as much as possible using the EasyBuild framework. EasyBuild uses tools chains to tag the way a program was compiled. To get the list of tool chains and a sort description use the EasyBuild command

 eb --list-toolchains

Note that on certain clusters, the Easybuild module must be loaded for the eb command to be available:

 module load EasyBuild

The users can also use EasyBuild to install programs locally.

Some modifications will be done in the CÉCI web site to highlight the tutorials already referenced in the FAQ.

One respondents asked to document which compute notes have direct Internet access.

This as been added in the resources section of each cluster in the CÉCI clusters page.

Several respondents asked for some training, support and tutorial for new users.

The CÉCI organizes each year around October a training session held at the UCL for all users specially for beginners. The FAQ, and tutorials referred-to there in, are oriented to help users during their first steps as a the cluster user.

Home directories

One responded asked for the common HOME folder feature.

The tender for common storage is at its last stage. A bidder has been selected and a purchase order will be sent. The set up is expected to be installed end of this year.

Support

One responded complained that some clusters support teams are understaffed.

Users must keep in mind that some system administrators have many more duties others than taking care of the CÉCI clusters and are not backed up by a team.

To overcome this problem, we are developing a centralized system to handle user requests so that users always get a response as fast as possible.

Resources

Several respondent complain about the different configuration of libraries between clusters

Uniforming the software modules is indeed something we are working towards. We will make progress in that direction thanks to the future common storage that will be installed at CÉCI. We will use it to store all modules and have an identical configuration on all the clusters.

One responded asked for archiving facility

The CÉCI does not have the equipment to do storage facility. The Users have to contact their local university IT staff to see what kind of archiving storage they can have acces to.

Scheduling

One respondent suggested to use a common fair share to all groups, users running on a cluster must have higher priority in that cluster that anyone who runs on more than one.

At present, the fairshare is indeed calculater cluster by cluster. One of the objectives behind the common filesystem that will be installed is to be able to work with a single Slurm installation governing all the clusters. With that single Slurm instance, the fairshare will be computed accross all clusters.

One respondent asked for some example job scripts for each of the clusters at a central place.

There is a submission script wizard to help users to generate a scrip for each cluster. It takes into account the clusters capabilities. It lets the choice of the cluster based your needs.

One respondent wanted to know how to get the priority of its jobs in a queue and how the priority system works.

There is a dedicated FAQ about slurm priorities on the CÉCI web site. Here are some excerpts:

The priority configuration in all cluster is based on multifactor which depends on five elements:

Job age: how long the job has been waiting in the queue ;
User fairshare: a measure of past usage of the cluster by the user ;
Job size: the number of CPUs a job requests ;
Partition: the partition to which a job is submitted , specified with the --partition submission parameter;
QOS: a quality of service associated with the job, specified with the --qos submission parameter.

All these are combined in a weighted average to form the priority. The weights can be found by running

sprio -w

Getting the priority given to a job can be done either with squeue

squeue -o %Q -j jobid

or with the sprio command which gives the details of the computation.

One respondent asked for longer time limits for the jobs (for NIC4).

This is a question that is often asked. NIC4 is configured to favour jobs that scale well, i.e. where you can trade job wall time for number of CPUs, because a lot of money has been spend in a very fast interconnect (Infiniband). Users must also take into account the fact that long jobs are incompatible with short waiting times. Users are encouraged to try and use checkpointing software such as http://dmtcp.sourceforge.net. A training session is dedicated to it for CECI users. The slides are available on http://www.cism.ucl.ac.be/Services/Formations/. Keep an eye on CECI web page for 2016 session. http://www.ceci-hpc.be/training.html, An alternative is to use a cluster there there maximum allowed running time is longer see the cluster list.