Tier-1 Zenobe quickstart guide

Note: this informatios is obsolete. Zenobe has been replaced by Lucia

Access

Access to Zenobe is only granted to users who submitted as project as described here.

Zenobe is accessible directly only from a CÉCI university network:

$ ssh -i ~/.ssh/id_rsa.ceci  my_ceci_login@zenobe.hpc.cenaero.be

Make sure to configure your SSH client like for the other CÉCI clusters to avoid the burden of specifying your SSH key and login each time and just type:

$ ssh zenobe

From other locations, access to Zenobe is possible through a gateway named hpc.cenaero.be. Your CÉCI key is actually installed on the gateway, and access from the gateway to Zenobe itself is automatic.

Manual process

The two-step process is typically as follows:

$ ssh -i ~/.ssh/id_rsa.ceci  my_ceci_login@hpc.cenaero.be

which leads to

================================================================================
 You're connected to hpc.cenaero.be - Authorized access only! 
================================================================================
 Dear HPC user,

 You're connected to the new hpc.cenaero.be gateway, as always, please limit
 file storage on this server to transient transfer only though this is enforced
 by user quota (soft limit: 10GB - hard limit: 40GB - grace period: 7 days).

 Thank you for you collaboration,

 The Cenaero HPC & Infrastructure team.
--------------------------------------------------------------------------------
                                                  Last updated on Feb. 21, 2014
================================================================================
$

and then

ssh zenobe

so that you arrive on Zenobe

Last login: Wed Jan 14 14:17:48 2015 from hades.cenaero.be
================================================================================

                   dMMMMMP dMMMMMP dMMMMb  .aMMMb  dMMMMb  dMMMMMP
                    .dMP" dMP     dMP dMP dMP dMP dMP dMP dMP
                  .dMP"  dMMMP   dMP dMP dMP dMP dMMMMK´ dMMMP
                .dMP"   dMP     dMP dMP dMP aMP dMP aMF dMP
               dMMMMMP dMMMMMP dMP dMP  VMMMP" dMMMMP" dMMMMMP

             dMMMMMMMMMMMMMMMMMMMMMMP" frontal node dMMMMMP"

            Warning : Only authorized users may access this system.

================================================================================

Automated process

The process can be automated by properly configuring your SSH client. Simply add the following to your ~/.ssh/config file.:

# Configuration for the gateway access using the CÉCI key
Host hades
    User YOURCECILOGIN
    ForwardX11 yes
    ForwardAgent yes
    Hostname hpc.cenaero.be
    IdentityFile ~/.ssh/id_rsa.ceci

# Configuration for the transparent access to zenobe through hades
Host zenobe
    User YOURCECILOGIN
    Hostname zenobe.hpc.cenaero.be
    ForwardX11 yes
    ForwardAgent yes
    IdentityFile ~/.ssh/id_rsa.ceci
    ProxyCommand ssh -q hades nc %h %p

Make sure to replace YOURCECILOGIN with the actual information (twice.) Once done, you can simply issue the command

ssh zenobe

to connect to the supercomputer. Make sure to configure your agent correctly set up (see here for Linux, or here for Windows) to avoid having to enter your passphrase each time you connect.

With such a configuration, you can also simply transfer files with scp or rsync:

$ echo 'test file' > ./myfile
$ scp myfile zenobe:
myfile                                                                                                                                                                  100%   10     0.0KB/s   00:00    
$ ssh zenobe
Last login: Fri Jan 16 11:59:54 2015 from hades.cenaero.be
[...]
$ echo myfile 
myfile
$

or run commands directly:

$ ssh zenobe hostname
frontal1

Disk space

You have read/write access to four directories, in the following filesystems:

/home: for user personnal codes, scripts, configuration files, and small datasets (quota 50GB)
/project: for data and results that are stored for the whole project duration (for current usage and/or quota: contact the support team -- see below )
/SCRATCH: for temporary results during the course of one job. Users have access to /SCRATCH/acad/projectname and /SCRATCH/primarygroup (See your primary group with groups.) You can get your current usage and quota with mmlsquota -g projectname or mmlsquota -g primarygroup.

Note that the setguid bit is set on the project and scratch directories. This ensure that data you place in those directories are owned by the project group rather than your own personnal group. In the ls -ls listings, they appear with an s rather than an x for the group. This bit need to be set for the sub directories too. If you have removed that bit you can set it back with chmod g+s <dir_name>. You can also use the newgrp <group_name> command to set the default group on all files you create in the current session.

Job preparation

Zenobe does not use Slurm as resource manager. Jobs are orchestrated by PBSPro version 13 The main differences are listed below. Note that the concept of 'partition' in the Slurm vocabulary is a 'queue' in PBS's context.

Commands

Getting general information about the queues is done with either qstat -q (resources and limits) or qstat -Q (jobs.)

All jobs are listed with qstat and you can look at your job only with the -u $USER parameter. To see full information about a specific job, use qstat -f jobid, and qstat -awT to have an estimation of when a pending job should start (equivalent to squeue --start.)

All nodes and their features can be listed with pbsnodes -a, and information about down nodes is available through pbsnodes -l.

Jobs are submitted with qsub and canceled with qdel.

To compile your program, use preferably the following modules:

module load compiler/intel/2015.5.223
module load mkl/lp64/11.2.4.223
module load intelmpi/5.0.3.049/64

First compile with -01 and check the results, then only use -O2 and -O3 and make sure there is no regression.

Queues

Zenobe offers two queues for CÉCI users:

large: this is the queue with the largest number of Ivybridge nodes (8208 cores by 24 cores per node). Jobs there are limited to 24 hours walltime, and must use at least 96 CPUs and at most 4320. Nodes are allocated exclusively to jobs on that queue (no node sharing.) Select it with #PBS -q large. Make sure to use at most 2625MB of memory per core to respect the general RAM/core ratio of the nodes on that queue.
main: this is the queue with 5760 last-generation Haswell cores (24 cores per node) without the time limitations. Select it with #PBS -q main.

The scheduling policy is a fairshare per category then per project. The category fairshare is set to the agreed-on distribution of compute time through the categories (see this document in French). All jobs from the same project have the same priority, that priority is based on the past usage of the cluster by the project. The fairshare is configured with a decrease factor of 0.5 and a period of one week.

Scripts

PBS scripts are very similar to Slurm script in that most Slurm parameters have a direct PBS equivalent. Still, there are some differences that must be taken into account.

Chunks

PBS resources are allocated (and thus requested) by chunks. A chunk is an allocation unit defined in terms of number of CPUs and memory, that will be entirely allocated to a node (it cannot be splatted across nodes.) Chunks are requested with the #PBS -l select= construct. For instance:

#PBS -l select=4:ncpus=1:mem=512

requests 4 chunks of one CPU each with 512MB of RAM. It is equivalent to

#SBATCH --ntasks=4
#SBATCH --mem-per-cpu=512

The CPUs will be allocated freely by PBS so you could end up with 4 CPUs on the same node, 4 CPUs on distinct nodes, or any combination on two or three nodes.

To have all CPUs allocated on a single node, you will request one chunk with 4 CPUs and 2GB of memory:

#PBS -l select=1:ncpus=4:mem=2048

which corresponds to

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=512

You can also define the number of MPI processes and OpenMP threads to create in a chunk like this:

#PBS -l select=4:ncpus=24:mpiprocs=2:ompthreads=12:mem=63000

so that PBS know that you want a total of 8 MPI processes each with 12 threads, and you want them allocated two by two. For more complete examples, refer to the script generation wizard.

Working directory

By contrast with Slurm which starts your job in the same directory it was submitted from, i.e. the current directory when you typed the sbatch command, PBS executes the scripts in a special directory by default. It is then common to start each job script with a cd ${PBS_O_WORKDIR} command.

Standard I/Os

Similarly, the output of your job is redirected to a file that will be available in the working directory only at the end of the job. To monitor a job, you consequently need to redirect the output to a specific file in your home or working directory. You can do this once and for all the commands in your script if you start it with

exec > filename

Again, see the script generation wizard for a practical example.

Important advice and common mistakes

Use queue 'Large' rather than 'Main'

You should mainly use the 'large' queue. That is where the majority of the compute nodes are located. It is also where the newest nodes are located. The 'main' queue should be used for tests and smaller jobs.

Do not run jobs on the frontend

Make sure to submit your jobs to PBS properly. If they end up running on the frontend, they will be killed. The frontend is only there to submit and manage jobs, handle files, and compile your code.

Use 24CPUs per node on queue 'large'

As nodes on the 'large' queue are allocated exclusively to one job, it is best to use all of its 24 cores, because they can't be used by other jobs.

Stay as close as possible to 2625 MB/core on queue 'large'

If your jobs scalre properly, choose the parameters so as to use at most 2625MB per cpu in your job to ensure optimal usage of the compute nodes.

Use job arrays when you have many similar jobs

When you have a large number of jobs doing nearly identical computations, you should use the job array capabilities of PBS. See Chapter 9 of the PBSPro manual

Make sure to use your canonical email address in #PBS -M

Alias email addresses may not work with PBS so make sure to only give your main email address when using the -M PBS parameter. Otherwise, your email will most probably not be sent to you and the system administrator will receive an error email.

Be concerned with how much memory you request

If you request more memory than you really need, you prevent other users from using CPUs that could be used if you had estimated the memory requirements of your job properly. Your jobs are then scheduled later than when they actually could run because the scheduler is waiting for the resources you have reserved to become available. The larger the resource, the longer the average waiting time will be. The whole throughput of jobs is slower than it could be, and the (costly) resources are wasted.

To estimate how much memory your job needs, you can run test jobs and connect with SSH to the compute node allocated to your job (which you can find with qstat -f JOBID | grep exec_host) and use the top command for instance to get that information. If you notice you have requested too much memory, you can also reduce it with the qalter command.

On queue 'large', the optimal memory usage is 2625 MB per core. Please use this value as upper bound on your memory request.

Make some effort estimating the running time of your jobs

If you request more time than you really need, you prevent other users from using CPUs that could be used if you had estimated the time requirements of your job properly. Your jobs are then scheduled later than when they actually could run because the scheduler do not consider them for backfilling. The longer the running time, the larger the average waiting time will be. The whole throughput of jobs is slower than it could be, and the (costly) resources are wasted.

To estimate how much time your job needs, you can run test jobs and have a look at the information sent by email by the system at the end of the job. If you notice you have requested too much time, you can also reduce it with the qalter command.

On queue 'large', maximum allowed running time is 24 hours. Please favor underestimating the running time (and use checkpointing) to overestimating.

Do not use SSH to launch jobs on the compute nodes

All compute processes should be managed by PBS in order to make sure PBS will be able to manage them if needed. SSH on the compute nodes should only be performed to monitor your jobs. Running jobs with SSH manually can potentially leave lots of ghost processes (processes that do not belong to a PBS job) that must be cleaned manually by the sysadmin.

Do not use qstat too much

The qstat command imposes some load on the scheduler, and users that use watch to monitor the output of qstat every second put a large load on the scheduler to the point that the proper scheduling of the jobs is affected. Running a lot of qstat to see when your job starts actually results in delaying the start of your job...

Reservations

Reservations to meet deadlines or to run debugging jobs can be requested by email to ceci-logist@lists.ulg.ac.be.

The following rules apply:

maximum 311040 hours.core per reservation (e.g. 4320 cores for 3 days)
maximum wall time of 10 days per reservation
maximum 4320 cores reserved at any time

Reservations are granted so as to best organize the load of the machine and be fair to all CÉCI users. See also https://tier1.cenaero.be/en/reservations.