Survey 2018: Summary of findings

From June 11th to June 28th, 2018, CÉCI users were invited to respond to an anonymous satisfaction survey.

The main questions were:

How did you learn about us ?
Was it easy to create an account, to connect ?
In how many publications have you already acknowledged the use of CÉCI clusters ?
What was the main problem you faced when you renewed your account?
How interesting to you was the reply to the previous survey posted on the CÉCI ?
What do you need ? (hardware/software/policy)
What is your typical job ?
What would your dream job be ?

The form ended with a free text field where users could leave suggestions or remarks. We got 30 comments, questions or suggestions.

More than 80 users responded to the survey, out of approximately 460 users active in the past few months. They originated from all CÉCI universities, with very diverse research interests in science and engineering. We thank those users for the time they took to fill the survey.

The text below offers a summary of all comments and suggestions made in the responses. This other document offers a synthetic view of the responses.

Accounts creation & connecting
Job scheduling and management
Available hardware
Software
Misc.

Account creation and connection to the clusters

Several respondent complained that the instructions on how to connect to the clusters were complex, and did not include screenshots, and that the prerequisites were not explicit

We can only encourage users to read the tutorial again as the Windows part contains 10 screenshots -- one per step -- and the Linux/Mac part contains examples which you can copy/paste in your terminal (mutatis mutandis of course).

We took note of some specific suggestions provided to improve these guides, but we cannot make the login process simpler as we believe that using SSH keys is the best compromise between security and ease of use. Note that many large HPC centers are moving to 2 factors authentication (2FA), a process that is more complex than simply using SSH keys. As for the prerequisites, they are listed in the quickstart section of the documentation. Also please take advantage of the training sessions that are organised every year during Fall.

One respondent mention the difficulty of thinking about changing the permissions on the SSH private key file before using it

This is a frequent mistake, that even seasoned sysadmins sometimes do, but the error message from the SSH client is clear Permissions 0644 for '/home/.../.ssh/id_rsa.ceci' are too open and the way to solve this is the first command mentioned in the guide to connect from a Unix-like environment.

Two respondents noticed it was not clear when the new key is active on the clusters during the account renewal process

Indeed, once you receive the private key, the corresponding public key is pushed to the central key repository. That central repository (an LDAP server) is replicated on all clusters every 10 minutes. Once the modification has propagated to the local cluster repositories, a service activates the new SSH key. That service also runs every 10 minutes. This does not happen at the same time on all clusters so there is a small time window during which one cluster might hold the former version of the key while another might already be updated.

The best course of action is to wait until the cluster that you are using the most rejects your former key before setting up the new key on your laptop. Once you have done that, you should regain access to your favorite cluster. The others could be rejecting the new key for up to 30 minutes but, most of the time, we see less than 10 minutes delay between clusters for the key update.

One respondent wrote that installing the VPN from outside the university was not easy and another complained that it was not easy to connect from outside the universities

Unfortunately VPN's are outside the scope of the CÉCI and are managed locally by the universities. In the case of a planned stay abroad, it is of course important to test that everything works, either with the VPN or the SSH gateways before leaving. Note that if you follow the tutorials to connect to the clusters to the end, your SSH client (MobaXterm or the command line client) should be properly configured for using a gateway.

It is important to note that you do not need to be outside the university network to test the SSH gateways! Going through a gateway works as well from within than from outside the university network. So be sure to test everything before you leave.

Job scheduling and management

One respondent asked for local scratch on Lemaitre3 and Zenobe

As detailed in the Disk space section in the documention every CÉCI cluster has a local scratch space. As for Zenobe, it is too late to introduce that, but we will escalate that request to the Cenaero Team for next version of the Tier-1 cluster.

The local scratch space is accessed with the $LOCALSCRATCH environment variable. On most configurations, the local scratch space is job-dependent and cleaned after the job has completed. The $LOCALSCRATCH is therefore often only defined in the scope of the job and you can use it in your submission script, but not in your interactive session on the frontends. One notable exception is Hmem, because on that cluster, the local scratch spaces are grouped into a global scratch space, but keeping the local access to the scratch; basically writes are local, reads are global.

One respondent expressed the wish to be able to run Apache Spark-like jobs

The plan for the renewal of the Vega cluster is to have something that is more oriented towards BigData-like workflows, depending on the output of the ongoing survey about the use of BigData/MachineLearning tools on the CÉCI clusters.

One respondent asked for the possibility to increase the time limit on a job in case the estimation was too short

Even if you cannot change that by yourself after your job has started, for specific cases you can always contact the local sysadmins and ask them to increase the max wall time for your job to overcome the limit you set at submission. This can be done even if this extra time overrides the maximum of the cluster partition in which it started.

One respondent complained that not all clusters send email job notifications

All clusters are configured to send emails, even if the contents of the emails might be different from one cluster to another. The key is that you need to tell Slurm to send emails with the --mail-user and --mail-type submission parameters. Remember to check the spam folder to verify they are not being filtered there. If at some point you notice that emails are not being sent having this option in place for a specific cluster, don't hesitate to contact the local sysadmins or use the CÉCI Support Wizard to report the issue.

Two respondent complained that sometimes they only have a small job to run and the queue is flooded by thousands of jobs

We cannot stress this enough; the waiting time for a job is not related to the length of the queue; it is related to the user's fairshare and to the size of the job. There's a dedicated section in the user docs discussing how the job scheduler handles the priorities.

Even if the queue has 1000 jobs pending, your job could be starting right away if your fairshare is favorable, or if it is so small it can be backfilled and scheduled in the shadow of a larger, higher-priority job. The scheduling policy is not first come first serve! But of course if you do not submit you have zero chance for your job to start... Some clusters furthermore have special partitions with small max time that allow for fast job turnaround. On those partitions, your small jobs are even more akin to start soon. To view the partitions on the clusters, use the sinfo command, or consult the clusters details on the CÉCI website.

One respondent hoped to have the opportunity to store 25TB of storage in scratch space, while another complained that only on Vega can store 1TB worth of data

Those are large amounts of data, but they can be handled by the global scratch spaces available on Lemaitre3, NIC4 and Vega without any problems. Of course, such amount of data is not meant to be stored in the home directory, you can access the scratch space on Lemaitre3 and NIC4 through the $GLOBALSCRATCH variable, Vega presents a notable exception as the home and scratch spaces are a single entity. On the other clusters, this is more difficult, so you will need to contact the local sysadmin to try and find a solution that works for you given the available hardware.

One respondent indicated having difficulties estimating the needed time and memory for their job

That is indeed not trivial and can, in most cases, only be dealt with trial and error as suggested in the specific entry about this of the Slurm FAQ. You could run one job asking for the maximum possible, then, take note of the jobid and once is finished look at the accounting report with the command

sacct --format JobName,MaxVMSize,MaxRSS,MaxRSSNode,Elapsed -j jobid

and try to make the margins tighter for the next jobs that run a similar program. You are interested in the values of MaxVMSize, the maximum amount of memory that was used during your job runtime, and Elapsed, the total elapsed time.

One respondent asked to increase the maximum wall time for jobs on the clusters

This has been discussed extensively on previous surveys, we refer the reader to previous answers for the details (see the Scheduling section for 2014 and 2016), but the short response is that the current situation is the convergence presenting the best compromise among the needs of all users.

One respondent stated that it was not easy to access information on their job statuses, looking forward to having a graphical dashboard for all their jobs

One thing that will help users manage jobs across clusters is the Slurm Federation that we are setting up starting with Lemaitre3. When the Federation will be in place and all clusters have joined, it will be possible to manage and monitor all CÉCI jobs from a single cluster of your choice.

Also, please note that the training session titled Using a workflow manager to handle large amounts of jobs presents tools that help managing a large number of jobs running on several clusters.

One respondent regrets that there is no CÉCI cluster offering the possibility to submit job up to 10000 cores

Indeed, CÉCI clusters are located at the Tier-2 level in the PRACE pyramid. They offer O(1000) cores per cluster. Above that, the Walloon Tier-1 system (Zenobe), offers O(10000) cores and can be accessed with your CÉCI account if you fulfill the submission of a project that justify its use.

For larger, Tier-0 clusters, that offer O(1000000) cores, you should apply for PRACE projects.

Available hardware

Two respondents requested more GPUs and more up-to-date generations

It is planned to have a couple of latest generation Nvidia GPUs in the cluster that will replace Dragon1, to be installed early next year. More GPUs might also be installed in the new Vega cluster depending on the output of the ongoing survey about the use of BigData/MachineLearning tools on the CÉCI clusters.

One respondent complained that 25GB for the home directory was too small

The common CÉCI home directory has a quota of 100GB per user. The same holds for Lemaitre3. On NIC4, Vega and Hercules, it is 200GB. Only on Hmem and Dragon1 have less than 100GB, but Hmem is reaching end of life and Dragon1 will be upgraded early next year.

One respondent noted that the separated HOME/CECIHOME and the fact that $CECIHOME does not point towards the same path on all clusters can be annoying

We plan on having the HOME and CECIHOME directories merged on newer clusters, but before doing so we must have enough confidence that the access to the common storage hosting CECIHOME is robust and reliable enough to be setup as the default home area for each user on every cluster.

Until now, we rolled out an initial phase providing the solution as a separate area from the local HOMEs on each cluster, as it relies on new technologies, both hardware and software, this allowed us to pinpoint different limitations which must be sorted out before we move to a complete merge of these areas. As we are working towards making it more robust, we will keep moving on the goal of making it the main login folder on all clusters.

As for the fact that $CECIHOME does not point towards to the same path on all clusters, the only exception is Hercules. We take note of this and will modify that on the new Hercules2 to be installed early next year.

One respondent complained that there were too many interruptions of the production due to maintenance or power cuts

Most of the time, no more than 1 maintenance period is planned per year per cluster. The others interruptions are mostly unplanned and out of control by the clusters sysadmins, they are due to works required on the hosting infrastructure (building, power, cooling, etc.) and even due to external maintenance of the power lines in the city.

One respondent asked to improve the speed of data transfer to the clusters

The transfer between clusters can be very fast making use of the TRSF partition on the shared CÉCI storage, which uses a dedicated 10Gbps network connecting the clusters. As for the transfer from your laptop or local workstation to the clusters that depends on the quality of the connection in your office, which is often limited to 100Mbps. And that is entirely out of the hands of CÉCI unfortunately.

For specific cases in which you are required to copy or retrieve a huge amount of data from the clusters and you find that your connection is a big limitation, you can contact the local sysadmins at your university to see if they can offer you a solution.

One respondent noted that they preferred clusters where there is only one type of CPUs over clusters where there are more

We remind that the user can choose the CPU type they want with the --constraint Slurm option, e.g. if you want to choose only the Skylakes on Lemaitre3 you add

#SBATCH --constraint=Skylake5000

to your submission script. You can check the features available to choose on each node by running sinfo -o "%15N %10c %10m %25f %10G".

Software

One respondent complained that loading a module often leads to many more modules being loaded, sometimes conflicting with other modules that they wish to load

Once a module is loaded, all its dependencies are loaded along with it, including the GCC version with which it was compiled. The reason is that once a program is compiled with GCC, it is dynamically linked to the compiler's runtime framework, that is needed at run time. Loading two versions of GCC modules at the same time would mean the runtime of one of those versions would be shadowed by the other, leading to software that might crash due to unknown symbols or other API mismatches. To cope with that problem, we are working towards organising the modules strictly by toolchain and release, so that all the software is compiled at least once with the same version of the compiler.

This will be already in place as the new generation of clusters starts rolling, after Lemaitre3 this year it will be Hercules2 and Dragon2, to be deployed early next year, that will share the same modules organisation.

Several respondent requested that the same software packages be installed on all the clusters, with the same versions available everywhere

That is something we are working towards, by (1) formalizing the process for software installation and standardization among the universities, and (2) developing the tools that ease the compilation process for all the cluster architectures (in terms of CPU family, Operating system, and interconnect) that CÉCI must deal with. The process of standardizing the modules will be paralleled with the creation of the Federation, for which it is of course of crucial importance.

Someone also requested information on how to compile software in one's home directory, this is covered in the documentation in the Compiling software from sources section, and in the corresponding training session.

One respondent complained that the compilation is not always easy

Compiling software on the cluster is important to take advantage of all its power. The difference between the same piece of software compiled with and without the proper optimization on modern processors can be up to 16 fold! Not optimizing the compilation process in the cluster where you are going to run your code is often thus a terrible waste of resource.

It is so important that, every year, two training sessions are dedicated to compilers (1, 2). So do not hesitate to ask the sysadmins for help and to give feedback to the developers of the software you use about your experience compiling the software.

Misc.

One respondent remembers that in the past, CÉCI has issued newsletters, and that those contained useful information

CÉCI tries to send newsletters for every important event in the life of the consortium. We try not to spam and keep the number of emails to the minimum, but often one newsletter is sent per year. The next one should come soon.