Survey 2013: Summary of findings

From May 8th to May 22nd, 2013, CÉCI users were invited to respond to an anonymous satisfaction survey.

The main questions were:

How did you learn about us ?
Was it easy to create an account, to connect ?
What do you need ? (hardware/software/policy)
What is your typical job ?
What would your dream job be ?

The form ended with a free text field where users could leave suggestions or remarks.

Some 60 users responded to the survey, out of the approximately 130 active on the clusters earlier that year. They originated from all CÉCI universities. Half of them are Linux users, the other half being Windows users or Mac users (with twice as much Windows users as Mac users.) The majority of respondents had a physics or chemistry background, or were active in a field which has the word 'computational' in its name, unsurprisingly.

The present document offers a response to all comments and suggestions made in the survey. This other document offers a synthetic view of the responses.

Connecting
Getting help
Job priorities
Job policies
Slurm features
Misc.

Connecting to the CÉCI clusters

Several respondents complained that SSH keys are difficult to use, especially when it comes to transferring data from one cluster to another.

Using encrypted SSH keys rather than passwords may seem more tedious when one is used to typing passwords. Though, using the correct tools (an SSH agent and, optionally, a password manager, all readily installed in any Linux distribution, as in MacOSX, or available with a Putty install), one never has to worry about them once they are properly installed on the user's personal computer.

Consequently, we will:

Add a reference to agent forwarding in "How to use CÉCI clusters from a terminal" document.
Add paragraph about ssh agents and agent forwarding in "How to use CÉCI clusters from Windows"

Some respondents wondered why are there so many constrains on connecting (passphrases, firewall)

Computers which are open to public access are continuously being attacked by unauthorized users trying to gain access, be it to organize DDOS attacks, store illegal copies of DRM-protected material, bitcoin mining, etc.

If one account on one CÉCI cluster is compromised, all clusters from all CÉCI universities are at risk.

The attacks, most of the time, are looking for weak passwords or SSH server vulnerabilities. The most efficient way of dealing with attacks coming from the outside is to simply prevent outside attackers from accessing the machines. This is why the clusters are only available from inside the university networks. This furthermore allows limiting the burden of responding to 0-day vulnerabilities to a limited set of virtual machines which can be de-activated without impacting the jobs.

Attacks from the inside are less frequent, but often more dangerous, because they often are more focused and targeted, even though they are easier to deal with since they come from known and controlled administrative domains.

Using passphrase-protected SSH keys is the most secure way of using SSH: it offers double factor authentication (ownership of the key and knowledge of the passphrase), avoids having to store sensitive information (password hashes) in a central location, hence making global brute-force attacks impossible, it prevents password reuse and weak passwords that lead to password guessing, it makes password sharing more complicated, etc. Note that many large computing facilities use hardware-based authentication (like online banking systems), or require you to physically go there and show your ID before you get SSH access.

Consequently, we will:

Add question to FAQ: How do I connect from outside the allowed networks ?
Add question to FAQ: How can I avoid typing my passphrase continuously ?

Some respondents reported issues with specific situations for creating an account (UCL-mons, ULg vpn, etc.)

We indeed experienced some hiccups when the system was first developed and deployed, but the concerns expressed here were addressed as soon as the users reported them (as iwas acknowledged by the respondents.)

Consequently, we will:

Keep on making the system as user-friendly as possible and available to all CÉCI users

Getting help

Several noted that help is sometimes difficult to find

Help can be found from the CÉCI website, through the FAQ. The last questions given directions on how to find the right person to contact for information or request about a specific cluster. All emails sent are signed by a real person and can be replied to. But still it appears some users do not find the help they need. Any concrete suggestion is welcome.

Consequently, we will:

Add a 'contact' page on the main menu
Made sure the old website redirects to the new one
Try and make the CÉCI website more visible

Many respondent noted that the CÉCI lacked visibility : they learn about it incidentally, difficult to Google, etc.

The CÉCI is a relatively young organization and definitely could use some promotion.

Consequently, we will:

Identify proper communication channels through which the CÉCI could advertise its actions

One respondent asked for more support for compiling, parallelization, etc.

The CÉCI logisiticians were recruited with the competences required to be able to offer such service to the users. When all the CÉCI clusters are installed, and their position is confirmed, the logistician will be more available to organize such offer. In the meantime, they respond to individual requests, along with the system administrators, the best they can.

Consequently, we will:

Advertise the new service 'support for compiling developing and optimizing' when it is ready

Many respondents asked for training aimed at beginners

Training sessions aimed at beginners are organised every Fall in Louvain-la-Neuve. They are open to all CÉCI members at no cost and start with Linux basic command line usage up to programming accelerators. They are listed on the CÉCI website under the 'Training' tab. It seems they are not visible enough.

Consequently, we will:

Make sure announcement emails are well forwarded (this year should be easier with ceci-users mailing list)
Add some advertisment for the training sessions in the sign-up email
Better specify prerequisite for each training sessions on the website

Jobs priorities

Several commented that the way priorities are computed is difficult to understand.

The way the priorities are computed is completely transparent to the user who reads the Slurm documentation. Nothing is hidden from the users ; formulas, parameters, configuration, are all available.

Consequently, we will:

Write a new document "Slurm priorities" with links to very detailed information, along with instructions on how to gather the information specific to a cluster (whether or not job age is taken into account, etc.)

One respondent expresses the concert that a user's past usage (e.g. a year ago) negatively impacts their current priority and suggests resetting the usage every year.

This is not the case as there is an exponential decay of the past usage before it is used to compute the fairshare. Past usage of the cluster is exponentially decayed before it enters the priority formula. It is a smoothed version of usage resetting.

Consequently, we will:

Focus on that aspect in the new, above-mentioned, document

One respondent is wondering why large jobs are given large priority on Lemaitre2

Large jobs are more difficult to schedule than one-cpu jobs, as they require large amounts of resources available at the same time. Slurm reserves cpus in advance for large jobs when they have high priority, but that is not sufficient to ensure large jobs are scheduled as quickly as other jobs. Considering the fact that Lemaitre2 was designed for large parallel jobs, we set the policy to favor large jobs over small ones in the computation of the priority.

One respondent expressed the concern that high-memory jobs should be accounted relatively to the memory consumed and not the number of cpus, which often is not more than one. The rationale is that a job consuming all memory would prevent any other job from running on that machine and thus is equivalent to using all cpus, and that that should be reflected in the priorities of future jobs of that user.

Even though that latter statement holds in the case all memory is used, it does not in the current, observed, situation, for two reasons. First, users who do want all memory are very few, and they very often use the --exclusive tag, effectively reserving all cpus on the node. Second, many jobs, on each cluster, request as few as one or two hundreds MB of RAM per cpu. Those jobs are able to start even when another job uses 80%, or even 90% of the memory on the node. These assertions are backed up by looking at the distribution of memory usage as reported by the sacct command. Furthermore, if large memory jobs with one cpu indeed prevented other jobs from using the available cores, the global cluster load could never be as high as 85% to 90%, which is currently the case on Hmem and Lemaitre2.

One respondent thinks 'tails could be a bit thicker' in the exponential decay of past usage in the computation of the priority because it decays so quickly that whether you used 100 cpus or 10, you get the same rounded value so to speak

The current half life is two weeks, we can increase it.

Consequently, we will:

We will make simulations and experiments, and possibly increase half-life decay of fair share

Job policies

It appears that 30% of users run mono-cpu jobs, often submitted by batches of 100+ jobs. Several other users, who submit parallel jobs, feel their jobs are penalized by such mono-cpu cores.

That sentiment often arises when users see the queue is full of jobs from the same user running at the same moment, and many more to follow in the pending state, with a high priority. First, it is important to note that the CÉCI clusters are Tier-2 level clusters and are dedicated to generic research and not to a specific type of jobs or workload. Every one is entitled to run on the CÉCI cluster. Then, when the queue was full of 1-cpu jobs, it has always been observed that the fairshare mechanism held its promises and the pending 1-cpu jobs were delayed to let other jobs run, even if, at some point those had high priority, because their priority dropped consequently to that user's other jobs being in the running state.

Consequently, we will:

Write a document explaining how to choose a cluster wisely

One respondent complained that the information about the policies should be explained somewhere

The information is at the moment incomplete and scattered across several reference websites.

Consequently, we will:

Add information about policies and queues in the cluster page

Some respondents would like to be able to launch very large jobs (1000+ cpus)

The largest among the CÉCI clusters at the time of the survey is Lemaitre2 with 1344 cores. If one user submits a 1000+ cpu cores, he will first have to wait a very long time (of course with 24h max time the waiting time will be reduced), he will nearly monopolize the whole cluster, which other users never witness happily. When larger clusters are setup, the maximum requestable amount of cpus will be adjusted accordingly. For users who need to run a 1000+ job, for instance for scaling analysis, debugging, etc, debugging sessions will be organized in the future. Those users can apply for an exclusive access to the whole cluster for a limited period of time.

Consequently, we will:

Organize debugging session for advanced users with specific needs
Make sure limitations (policies) are explained in the CÉCI website or local wikis

Queue max time is problematic. Several respondents complain that their jobs are not checkpointable nor parallelizable. They need long max time. But at the same time many of them complain about the long waiting time. Some others prefer small max time to have rapid rotation, less waiting time, and possibility for larger jobs to run faster. 14/60 respondents are ok with 24h and the majority are ok with less than a week. But the others feel the resources are not being fairly shared if they are not able to run long times.

The fact that some users run program which are not designed to checkpoint/restart nor to parallelize is a real problem for all users, especially those who make the effort of working towards checkpointing and parallelization. Requiring long queues increases the waiting time for all jobs on those queues. Dedicating nodes to such queues makes them unavailable for largely parallel jobs, and increases dramatically the maintenance time, hence node unavailability, in case a shutdown is required. Resource allocation is seen as unfair by the others who see nodes being allocated to the same user for weeks or even months.

Consequently, we will:

Create a training session dedicated to checkpoint/restart
Setup tools on the cluster to help users checkpoint/restart programs which do not include that feature
Continue to have long queues on dedicated clusters (Mons, Namur)
Contact users who submit to long queues to identify the obstacles to checkpointing and parallelisation.
Study the relevance to add a long queue to Vega

It appears only 25% of the respondents sometimes use a 'fast' queue for debugging.

It is difficult to say whether this is due to lack of need or lack of awareness.

Consequently, we will:

Make sure information about those queues are easily discovered
Think of Setting a uniform naming scheme across all clusters

Slurm features

Several respondents lament the lack of Job arrays in Slurm.

Job Arrays are being incorporated in recent versions of Slurm. We will do our best to upgrade Slurm to that version on Lemaitre2 and Hmem during the planned maintenance session.

Consequently, we will:

Plan Slurm upgrade on CÉCI clusters where possible

Several complained that their MPI processes were scattered over several nodes.

Slurm is configured to try and minimize the number of distinct nodes on which cpus are allocated for an MPI job. But it will not delay the start of a job to find all requested cpus on the same node unless explicitly instructed by the user to do so with the --nodes, --ntasks-per-node and/or --exclusive options. Those options allow the user to specify a maximum number of distinct nodes to use. The lower that number the faster the computation but the longer the waiting time.

Consequently, we will:

Add question to FAQ: How do I prevent my processes from being scattered across too many nodes ?

One respondent commented that many Slurm features are unknown from the users. He suggested that some time be dedicated to Slurm during the CÉCI scientific days.

It appears many Slurm features are indeed unknown to many users. We furthermore believe that many presentation at the scientific days could focus less on the scientific aspects of their research (which they can do at any other meeting/conference they wish) and more on the computational challenges involved and the Slurm tricks, etc.

Consequently, we will:

try to give a short Slurm highlight (e.g. new features, good practice, etc.) at each scientific day
write specific guidelines for speakers at CÉCI scientific days

One respondent expressed his doubts about how to know when your job starts with respect to the balance nbcores/time to wait

It is indeed very difficult to know what would be faster: requesting more cores, at the cost of maybe having to wait longer, or requesting less cores, leading to longer computations, but with a job that might start sooner. Some hints: you can use the --test-only option of srun to get a hint about when a job will start and compare both strategies. But let us remember one key aspect: short maximum running times ensure shorter waiting times.

Consequently, we will:

Add that question to the Slurm FAQ

Miscellaneous

One respondent noted that the Python environment was a bit poor

It appears most users tend to install additional modules in their home directory, which might be tedious for some users.

Consequently, we will:

Try to make the Python install uniform across the CÉCI clusters with standard modules (numpy, scipy, matplotlib, etc.)
Add information on how to install a Python module in a home directory on the website

One respondent asked about same NFS for each cluster

Sharing the home directories among clusters in a big milestone towards which the CÉCI is working. But it requires a very fast network (10Gb/s) and a distributed, location-aware, redundant, file system. We are investigating both.

Consequently, we will:

work towards the next step which is to connect two CÉCI clusters with a 10Gb network

One respondent wants to hear the Star Wars theme song at the end of their job.

One solution is to use a mail program that is able to play specific sounds based on email contents through custom filter rules, for instance.

Survey 2013: Summary of findings

Contents

Connecting to the CÉCI clusters

Several respondents complained that SSH keys are difficult to use, especially when it comes to transferring data from one cluster to another.

Some respondents wondered why are there so many constrains on connecting (passphrases, firewall)

Some respondents reported issues with specific situations for creating an account (UCL-mons, ULg vpn, etc.)

Getting help

Several noted that help is sometimes difficult to find

Many respondent noted that the CÉCI lacked visibility : they learn about it incidentally, difficult to Google, etc.

One respondent asked for more support for compiling, parallelization, etc.

Many respondents asked for training aimed at beginners

Jobs priorities

Several commented that the way priorities are computed is difficult to understand.

One respondent expresses the concert that a user's past usage (e.g. a year ago) negatively impacts their current priority and suggests resetting the usage every year.

One respondent is wondering why large jobs are given large priority on Lemaitre2

One respondent thinks 'tails could be a bit thicker' in the exponential decay of past usage in the computation of the priority because it decays so quickly that whether you used 100 cpus or 10, you get the same rounded value so to speak

Job policies

It appears that 30% of users run mono-cpu jobs, often submitted by batches of 100+ jobs. Several other users, who submit parallel jobs, feel their jobs are penalized by such mono-cpu cores.

One respondent complained that the information about the policies should be explained somewhere

Some respondents would like to be able to launch very large jobs (1000+ cpus)

It appears only 25% of the respondents sometimes use a 'fast' queue for debugging.

Slurm features

Several respondents lament the lack of Job arrays in Slurm.

Several complained that their MPI processes were scattered over several nodes.

One respondent commented that many Slurm features are unknown from the users. He suggested that some time be dedicated to Slurm during the CÉCI scientific days.

One respondent expressed his doubts about how to know when your job starts with respect to the balance nbcores/time to wait

Miscellaneous

One respondent noted that the Python environment was a bit poor

One respondent asked about same NFS for each cluster

One respondent wants to hear the Star Wars theme song at the end of their job.