Survey 2015: Summary of findings

From June 15th to June 26th, 2015, CÉCI users were invited to respond to an anonymous satisfaction survey.

The main questions were:

How did you learn about us ?
Was it easy to create an account, to connect ?
What was the main problem you faced when you renewed your account?
How interesting to you was the reply to the previous survey posted on the CÉCI
What do you need ? (hardware/software/policy)
What is your typical job ?
What would your dream job be ?

The form ended with a free text field where users could leave suggestions or remarks.

Nearly 100 users responded to the survey, out of the approximately 480 active on the clusters earlier that year. They originated from all CÉCI universities, with very diverse research interests.

The present document offers a summary of all comments and suggestions made in the responses. This other document offers a synthetic view of the responses.

Accounts & Connecting
Documentation
Home directories
Resources
Job scheduling
Software
Misc.

Connecting to the CÉCI clusters

Some respondents found difficult to find a clear procedure or to understand SSH connexion

The mail that is sent along with the private key to the users, contains a link to the corresponding entry in the FAQ, which in turns links to a very detailed Linux tutorial and a Windows tutorial. Admittedly, the Windows tutorial is a bit old and screen captures may not reflect the current situation.

Furthermore, a training session is dedicated, every year, to SSH connections, but this session is usually exclusively Linux-related.

Consequently we will:

Update the Windows-related tutorial on the website
Adapt the corresponding training session to focus on Windows too

One respondent complained that it is not possible to renew one's SSH key if not connected to a university network

To renew one's key from outside a university network, several options are available, all based on SSH:

either run Firefox on the gateway that you use to access the clusters (Make sure to use X11 forwarding -X SSH option), with something like : ssh -f -X GATEWAY firefox -no-remote https://login.ceci-hpc.be (GATEWAY being the gateway you are using to SSH to the clusters, e.g. hall.cism.ucl.ac.be, or hal.fundp.ac.be, etc.);
or create an SSH tunnel on the HTTPS port ssh -L1234:www.ceci-hpc.be:443 GATEWAY and point your browser to https://localhost:1234/ (you will need to copy/paste and adapt the link you get by email)
or create a SOCKS proxy by running ssh -D1234 GATEWAY and configure your browser to use it. A dynamic proxy can also be done on Windows with PuTTY;
or even create a SSH pseudo-VPN with SSHuttle: sshuttle -r GATEWAY $(host www.ceci-hpc.be) and then use your browser with its default settings (make sure to replace GATEWAY by the proper address of the gateway you'll use.

Note that if your university offers VPN access (notably ULg, UMons, and UNamur) you can use it to access the CÉCI website.

It seems those possibilities are not known from most CÉCI users.

Consequently we will:

Write a tutorial on how to renew one's account from outside the university networks.

Some respondents found it tedious to copy the private key to the servers (in order to copy data from one cluster to another)

With a properly configured SSH agent, there is no need to copy a SSH private key on the clusters to make inter-cluster file transfers or to hop from one cluster to another.

You can use the -A option to allow the agent running on your laptop to perform the authentication when you SSH (or SCP) from one cluster to another.

ssh -A hmem

The only caveat of this approach is that your local agent must be running at all times during file transfers, which can be problematic for very large file transfers.

If you really want to copy your ssh key to all clusters, simply run the following command:

for cluster in hmem lemaitre2 dragon1 hercules vega nic4; do  scp ~/.ssh/id_rsa.ceci $cluster:.ssh/ ; done

with a properly configured SSH client.

In the future, transfers will be made much simpler thanks to a single shared directory for all clusters (See below.)

Home directories

Several respondent asked about a common home directory for the clusters to ease data transfers

That topic rose last year in the survey, and was also discussed at the CÉCI Bureau meetings. Last summer, a 'Grand Equipment' project was submitted to FNRS to fund a global shared file storage for the CÉCI that will be used for project/groups home directories in addition to user home directories. Part of the funding was granted by the FNRS and the remaining was asked to the universities. Meanwhile, all vendors that could offer a solid solution were met with and their solutions were evaluated. In parallel, discussions were (and still are) conducted with Belnet to get a fast, dedicated, network linking all clusters and the future shared storage. At present, a European Request for Proposals (RFP) is being redacted to be issued in autumn 2015. The chosen solution should be operation in beginning of 2016.

One respondent asked whether the clusters' home directories could be remotely mounted with NFS or Samba/CIFS

Remotely mounting the home of the clusters is interesting when using an integrated development environment (IDE) that does not support transparent deployment (which most do), or to use locally-installed software.

Unfortunately, NFS and Samba/CIFS are not considered secure enough to be setup through public networks.

Fortunately, an SSH-based solution exists: sshfs. Once sshfs is installed on your laptop, simply create a mount point (e.g. mkdir /mnt/hmem) and run the following command: sshfs hmem: /mnt/hmem. You can then access your remote file from your local directory.

Consequently we will:

introduce sshfs in the relevant training session
add a section about sshfs to the CÉCI website

Documentation

One respondent complained that the way to use the Slurm utility is not easy to take up without the assistance of a CECI administrator

Some users who do not have much experience with scientific computing may find that using a resource manager is a difficult thing. That is why a training session is dedicated to Slurm every year. Furthermore, the Slurm tutorial has been rewritten based on the comments collected during last year's survey, with many workable examples, and a submission script wizard has been developed to help coping with the various Slurm options and the differences in the clusters.

Never hesitate to contact the system administrators to get help using Slurm.

A respondent asked for simpler tutorial, with more basic notions introduced

The CÉCI FAQ is written to address the basic notions in the very first questions, and links to external documentation. For basic notions, a rewritten explanation would have very little added value compared to what Wikipedia for instance can offer.

One user requested a single CÉCI website with documentation for all clusters

That is a concern that all system administrator share, but there is unfortunately no easy solution that would ensure

up to date and relevant content
easy-to find entry point
no work duplication

At the time of writing, we have tried to make sure every local documentation links to the central one on http://www.ceci-hpc.be, and we have added links to the local documentation in the Cluster section.

We will keep on working towards a unified documentation.

Support

One respondent complained that the contact person in their university never replied to an email, another observed that the system administrators must be 'a bit overwhelmed'

In the event the system administrator of one university does not promptly respond to a request (which can happen because of vacation, or illness or urgent situation to deal with -- keep in mind that some system administrators have many more duties others than taking care of the CÉCI clusters) you can send seek further help by emailing the CÉCI logisticien.

One suggested a central ticketing system

A central issue management system would indeed benefit the users, we are certainly aware of that, and it has been on our minds since the beginning. The main issue is the integration with the existing ticketing systems in the five universities.

We will go forward with that project, taking advantage of the increased workforce thanks to the hire of Juan

Resources

Several respondent requested shared directory for groups

When the common storage will be operational, its capacity will be used to host the individual home directories, a common central repository for software that would be available on all clusters, and also a group share where users in a group can work together.

In the meantime, a small tutorial on how to share data with colleagues on the CÉCI clusters is available.

There are few available GPU. Cheap nvidia gaming gpu such GTX980 or titan could be used instead of nvidia tesla gpu.

Unfortunately, a cheap gaming GPU card is not compatible with the cluster hardware: firstly they are not built for an intensive 24/7 usage and secondly, they are designed with an on-board cooling system that is not compatible with the rack-level cooling system: the GPU fan would interfere with the cold air flow through the compute nodes, or, worse, would be blocked by a side of the chassis.

Many respondents requested more resources

Many respondents requested more CPUs, faster CPUS, more RAM and/or more disks. Others wished their jobs wait less in queue, which is basically an equivalent request.

Fortunately, that is a domain where you, the user, have the power: make sure to express your need to the authorities so that the CÉCI gets the funding we need to meet our users' expectations.

Scheduling

A few respondents asked for week-long jobs

As discussed in the previous surveys responses, long jobs are incompatible with short waiting time unless resources are infinite. See above item for a solution. In the meantime, please try and use checkpointing software such as DMTCP (a training session has been dedicated to it for the past two years.)

One respondent complained that when he/she submits large numbers of jobs that require very small cpu time, they sometimes get stuck in the queue for hours/days, while he/she submits an MPI job with large resource requirements it will run almost immediately

On most clusters, the job priority is proportional to the size of the job because large jobs are more difficult to schedule as they require more resources to be freed before they start. Such a settings also gives priority to jobs that could not be run on a typical computer and that really need the clusters.

Submitting a large number of jobs with very small run time is a bad idea; it puts a lot of burden on the scheduler and the overhead of resource management becomes as CPU consuming as the jobs themselves, which is senseless. Make sure to group small tasks into a single job so that a job runs for at least a few hours.

One respondent asked how the priority queue works and is calculated.

A page on the CÉCI website is dedicated to priority computations: see it here. It appears that some elements have changed depending on the Slurm version, and the page might not be up to date.

Consequently we will:

update that page to cope with the new vocabulary and extend with more focussed information

Several respondents explained it is difficult for them to predict the quantity of RAM or the time that will be needed for the job to complete.

We agree that, especially when using closed-source software, this is not an easy task. Often, the only thing to do is to observe the memory and time needed for a completed job and transpose that to future jobs.

Being precise to the megabyte is not necessary. What is important is to make sure the request in the right order of magnitude. This helps your job be scheduled sooner and helps maximize the utilization of the cluster.

It is a bit difficult to interpret the memory usage generated by the automatic email or from the output of `sacct`

We agree that the notion of 'memory' in Linux is multi-faceted and might be difficult to grasp.

Consequently we will:

dedicate a few slides to the issue in the Slurm training session

Software

One respondent requested PGI to be installed with OpenACC features on the clusters with GPU

As the PGI compiler is not free, and the CÉCI has no budget at the moment for software, the cost of that software must be supported by the users and all cost-related issues are handled at the university-level, not the CÉCI-level. This might not be clear for the users.

Consequently we will:

add section about commercial software in the FAQ.

Two respondents requested more documentation to be available on the CÉCI website on how to launch a computation using their favorite software (Namely Gaussian and Matlab).

As far as Matlab is concerned, a training session is dedicated to using Matlab on the clusters by means of the Matlab Compiler, and has been organized for five years now.

But more generally, for commercial software that is not even accessible to all the CÉCI users, it is the responsibility of the user to learn how to use the software he needs to use. If someone is willing to share their knowledge about such software with their colleagues, though, we will be more than happy to make the information available through the CÉCI website.

But for Gaussian, the license is very restrictive and requires interaction between the system administrators, licence owner, and potential user before access can be granted. Such information cannot not be made publicly available.

Misc

Survey assumes there is only one type of job while needs vary in time

Indeed, to avoid information drowning, we ask about 'the' typical job in the survey. If you feel you have distinct needs in time, or depending on the topics you are working on, feel free to answer the survey several times.

Consequently we will:

Highlight this fact in the next survey

Survey 2015: Summary of findings

Contents

Connecting to the CÉCI clusters

Some respondents found difficult to find a clear procedure or to understand SSH connexion

One respondent complained that it is not possible to renew one's SSH key if not connected to a university network

Some respondents found it tedious to copy the private key to the servers (in order to copy data from one cluster to another)

Home directories

Several respondent asked about a common home directory for the clusters to ease data transfers

One respondent asked whether the clusters' home directories could be remotely mounted with NFS or Samba/CIFS

Documentation

One respondent complained that the way to use the Slurm utility is not easy to take up without the assistance of a CECI administrator

A respondent asked for simpler tutorial, with more basic notions introduced

One user requested a single CÉCI website with documentation for all clusters

Support

One respondent complained that the contact person in their university never replied to an email, another observed that the system administrators must be 'a bit overwhelmed'

One suggested a central ticketing system

Resources

Several respondent requested shared directory for groups

There are few available GPU. Cheap nvidia gaming gpu such GTX980 or titan could be used instead of nvidia tesla gpu.

Many respondents requested more resources

Scheduling

A few respondents asked for week-long jobs

One respondent complained that when he/she submits large numbers of jobs that require very small cpu time, they sometimes get stuck in the queue for hours/days, while he/she submits an MPI job with large resource requirements it will run almost immediately

One respondent asked how the priority queue works and is calculated.

Several respondents explained it is difficult for them to predict the quantity of RAM or the time that will be needed for the job to complete.

It is a bit difficult to interpret the memory usage generated by the automatic email or from the output of sacct

Software

One respondent requested PGI to be installed with OpenACC features on the clusters with GPU

Two respondents requested more documentation to be available on the CÉCI website on how to launch a computation using their favorite software (Namely Gaussian and Matlab).

Misc

Survey assumes there is only one type of job while needs vary in time

It is a bit difficult to interpret the memory usage generated by the automatic email or from the output of `sacct`