Clusters at CÉCI

The aim of the Consortium is to provide researchers with access to powerful computing equipment (clusters). Clusters are installed and managed locally at the different sites of the universities taking part in the Consortium, but they are accessible by all researchers from the member universities. A single login/passphrase is used to access all clusters through SSH.

All of them run Linux, and use Slurm as the job manager. Basic parallel computing libraries (OpenMP, MPI, etc) are installed, as well as the optimized computing subroutines (e.g. BLAS, LAPACK, etc.). Common interpreters such as R, Octave, Python, etc. are also installed. See each cluster's FAQ for more details.

ClusterHostCPU typeCPU count*RAM/nodeNetworkFilesystem**AcceleratorMax timePreferred jobs***
NIC4 ULg SandyBridge 2.0 GHz
IvyBridge 2.0 GHz
2048 (120 x 16 +
           8 x 16)
64 GB QDR Ib FHGFS 144 TB None 2 days MPI
Vega ULB Bulldozer 2.1 GHz 2752 (43 x 64) 256 GB QDR Ib GPFS 70 TB 2x Tesla M2090 14 days serial / SMP /
MPI
Hercules UNamurSandyBridge 2.20 GHz
Westmere 2.66 GHz
896 (32 x 16 +
        32 x 12)
36..128 GB GbE NFS 20 TB 3x Tesla S2050 63 days serial / SMP
Dragon1 UMons SandyBridge 2.60 GHz 416 (26 x 16) 128 GB GbE RAID0 1.1 TB 4x Tesla C2075 21 days serial / SMP
Lemaitre2UCL Westmere 2.53 GHz 1380 (115 x 12) 48 GB QDR Ib Lustre 120 TB 3x Quadro Q4000 3 days MPI
Hmem UCL MagnyCours 2.2 GHz 816 (17 x 48) 128..512 GB QDR Ib FHGFS 30 TB None 15 days SMP

The Consortium also enables users with access to Tier-1 facilities, not operated by the universities.

ClusterHostCPU typeCPU count*RAM/nodeNetworkFilesystem**AcceleratorMax timePreferred jobs***
Zenobe Cenaero Haswell 2.50 GHz
IvyBridge 2.7 GHz
5760 (240 x 24)
8208 (342 x 24)
64..256 GB
64 GB
QDR Ib
FDR Ib + QDR Ib
GPFS 350 TB t.b.a. 24 hours MPI
* In this context, a CPU is to be understood as a core or a hardware thread | count = #nodes x CPU/node
** Filesystem = global scratch space (other than /home) | RAID is a filesystem local to the nodes
*** SMP = all processes/threads on the same node | MPI = multi-node

Visual representation of the clusters' features. Click to enlarge.

The CÉCI clusters have been designed to accomodate the large diversity of workloads and needs of the researchers from the five universities.

On one end of the spectrum is the sequential workload (Northern part of the figure). That type of workload needs very fast CPUs and often a large maximum job time, requiring limitations on the number of jobs a user can run simultaneously to allow a fair sharing of the cluster.

On the other end of the spectrum is the massively parallel workload (Southern part of the figure). For such workloads, individual core performance is less crucial, as long as there are many available cores. A job will be allowed to use a very large number of CPUs at a time, but only for a limited period of time to ensure a fair sharing of the cluster. The maximum walltime for a job must even be shorter for researchers engaged in development activities (to reduce the waiting time in the queue to a minimum), while those mainly concerned with production will prefer slightly larger max time values (to avoid unnecessary overhead due to checkpointing tools needed with short maximum time values.) Generally, parallel workloads necessitate of course a fast and low latency network.

Some of the 'parallel' clusters are made of fat nodes (South-West), meaning that the number of cores per nodes is large (e.g. 48 or even 64), while others rely on a large number of smaller (thin) nodes (North-East.) Fat nodes are more suitable for a shared-memory type of work such as using OpenMP, or pthreads. They can host jobs with very large shared memory requirements (up to half a terabyte of RAM). By contrast thin nodes require the use of Message Passing protocols such as MPI or PVM. They offer a better "network bandwith vs. number of core" ratio which makes them more suitable for jobs issuing lots of IO operations -- e.g.jobs that put a heavy load on the centralized scratch filesystem.

The clusters have been installed gradually since early 2011, first at UCL, with HMEM being a proof of concept. At that time, the whole account infrastructure was desgined and deployed so that every researcher from any university was able to create an account and login to HMEM. Then, LEMAITRE2 was setup as the first cluster entirely funded by the F.N.R.S. for the CÉCI. DRAGON1, HERCULES, VEGA and NIC4 have followed, in that order, as shown in the timeline here-under.

Thanks to a private, dedicated, 10Gbps network connecting all CÉCI sites, all the CÉCI clusters share a common storage space in addition of all local spaces. That CÉCI shared storage is based on two main storage systems hosted in Liège and Louvain-la-Neuve. Those storage systems are synchrnonuously replicated, meaning that any file written to one of them is automatically written to the other one. They are connected to five smaller storage systems that serve as buffers/caches through a dedicated 10GBps network. Those caches are located on each site and are tightly connected to the cluster compute nodes.

NIC4

Hosted at the University of Liège (SEGI facility), it features 128 compute nodes with two 8-cores Intel E5-2650 processors at 2.0 GHz and 64 GB of RAM (4 GB/core), interconnected with a QDR Infiniband network, and having exclusive access to a fast 144 TB FHGFS parallel filesystem.

Suitable for:

Massively parallel jobs (MPI, several dozens of cores) with many communications and/or a lot of parallel disk I/O, 2 days max.

Resources

  • Home directory (20GB quota per user)
  • Working directory /scratch ($GLOBALSCRATCH)
  • Nodes have access to internet
  • Default queue* (2 days, 448 cores max per user, 64 jobs max per user, among which max 32 running, 256 CPUs max per job)

Access/Support:

SSH to nic4.segi.ulg.ac.be (port 22) with the appropriate login and id_rsa.ceci file.

FAQ: http://www.ulg.ac.be/nic4

SUPPORT: nicadm@segi.ulg.ac.be

Server SSH key fingerprint: (What's this?)
MD5: 25:17:ae:23:ac:35:65:e7:11:c2:78:a7:b8:76:44:e0
SHA256: HGoO1ycMf16AwaZDJY3WQdON7wtAD0m4qZq6IfvUQeQ

VEGA

Hosted at the University of Brussels, it features 44 fat compute nodes with 64 cores (four 16-cores AMD Bulldozer 6272 processors at 2.1 GHz) and 256 GB of RAM, interconnected with a QDR Infiniband network, and 70 TB of high performance GPFS storage.

Suitable for:

Many-cores (SMP and MPI) and many single core jobs, 14 days max.

Resources

  • Home/Working directory /home ($GLOBALSCRATCH=$HOME)
  • Nodes have access to internet
  • Def queue* (Max 7 days, 1024 cpus/user, 150 jobs/user)
  • Generic resource* : gpu

Access/Support:

SSH to vega.ulb.ac.be (port 22) with the appropriate login and id_rsa.ceci file.

SUPPORT: http://vega.ulb.ac.be or ceci-support@ulb.ac.be

Server SSH key fingerprint: (What's this?)
MD5: e5:67:3d:0e:e1:1b:01:7b:48:de:bb:42:9c:76:d8:9a
SHA256: u6qRmUvO/dzdAFUJz6hD27wJTPufBQPQLpe8LRCO+bA

HERCULES

Hosted at the University of Namur, this system currently consists of approximately 900 cores spread across 65 compute nodes. It mainly comprises 32 Intel Sandy Bridge nodes, each with two 8-core E5-2660 processors at 2.2 GHz and 64 or 128 GB of RAM (8 nodes), and 32 Intel Westmere compute nodes, each with two X5650 6-core processors at 2.66 GHz and 36 GB ,72 GB (5 nodes) or 24 GB (5nodes) of RAM. All the nodes are interconnected by a Gigabit Ethernet network and have access to three NFS file systems for a total capacity of 98 TB.

Suitable for:

Long (max. 63 days) shared-memory parallel jobs (OpenMP or Pthreads), or resource-intensive sequential jobs.

Resources

  • Home directory (200 GB quota per user) use hc_diskquota
  • Working directory /workdir (400 GB per user) ($WORKDIR) use hc_diskquota
  • Local working directory /scratch ($TMPDIR) dynamically defined in jobs
  • No internet access from nodes
  • cpu queue* (Max 63 days, 48 cpus/user)
  • gpu queue* (Max 63 days, 48 cpus/user)
  • Generic resource*: gpu

Access/Support:

SSH to hercules.ptci.unamur.be (port 22) with the appropriate login and id_rsa.ceci file.

FAQ: https://www.ptci.unamur.be

SUPPORT: support.ptci@unamur.be

Server SSH key fingerprint: (What's this?)
Either
MD5: 66:50:e1:67:91:d8:17:1e:b7:be:48:00:e2:2c:7a:9f
SHA256: SyLaaBe7CuO7Dpa6vJa0vbAUxnYSpl30xaJo5yBF//c
or
MD5: 8c:09:1a:10:ad:32:87:af:82:52:33:0f:03:d1:5e:d2
SHA256: LzByp8XBhpgy+2lB1DZcpieYUCSq8FEfLBLPm+WB8xg

DRAGON1

Hosted at the University of Mons, this cluster is made of 26 computing nodes, each with two Intel Sandy Bridge 8-cores E5-2670 processors at 2.6 GHz, 128 GB of RAM and 1.1 TB of local scratch disk space. The compute nodes are interconnected with a Gigabit Ethernet network (10 Gigabit for the 36 TB NFS file server). Two additional nodes have two high-end NVIDIA Tesla 2175 GPUs (448 CUDA Cores/6GB GDDR5/515Gflops double precision).

Suitable for:

Long (max. 21 days) shared-memory parallel jobs (OpenMP or Pthreads), or resource-intensive (cpu speed and memory) sequential jobs.

Resources

  • Home directory (20GB quota per user)
  • Local working directory /scratch ($LOCALSCRATCH)
  • No internet access from nodes
  • Long queue* (Max 21 days, 40 cpus/user, 500 jobs/user)
  • Def queue* (Max 5 days, 40 cpus/user, 500 jobs/user)
  • Generic resource*: gpu

Access/Support:

SSH to dragon1.umons.ac.be (port 22) with the appropriate login and id_rsa.ceci file.

FAQ: http://dragon1.umons.ac.be/

SUPPORT: either sebastien.kozlowskyj@umons.ac.be or alain.buys@umons.ac.be

Server SSH key fingerprint: (What's this?)
MD5: 9a:e8:e8:56:57:80:87:05:0d:55:c7:b8:5b:ba:48:b5
SHA256: EsGHFYSG2g1a0FzCaAohAbp859f3R++QtwEeeg4Zp4w

LEMAITRE2

Hosted at Université catholique de Louvain, it comprises 112 compute nodes with two 6-cores Intel E5649 processors at 2.53 GHz and 48 GB of RAM (4 GB/core). The cluster has exclusive access to a fast 120 TB Lustre parallel filesystem. All compute nodes and management (NFS, Lustre, Frontend, etc.) are interconnected with a fast QDR Infiniband network.

Suitable for:

Massively parallel jobs (MPI, several dozens of cores) with many communications and/or a lot of parallel disk I/O, 3 days max.

Resources

  • Home directory (50GB quota per user)
  • Working directory /scratch ($GLOBALSCRATCH)
  • Nodes have access to internet
  • Default queue* (3 days, max 50 running jobs/user)
  • PostP queue* with GPUs (6 hours)
  • Generic resource*: gpu

Access/Support:

SSH to lemaitre2.cism.ucl.ac.be (port 22) with the appropriate login and id_rsa.ceci file.

Documentation: http://www.cism.ucl.ac.be/doc

SUPPORT: egs-cism@listes.uclouvain.be

Server SSH key fingerprint: (What's this?)
MD5: 9c:36:f7:cc:15:66:b3:95:c2:f3:9b:42:20:b1:a4:6d
SHA256: TUgon9zeJVGcNZe76OAxYoHoakyofkdqeYf0GOEJOYA

HMEM

Hosted at the Université catholique de Louvain, it mainly comprises 17 fatnodes with 48 cores (four 12-cores AMD Opteron 6174 processors at 2.2 GHz). 2 nodes have 512 GB of RAM, 7 nodes have 256 GB and 7 nodes have 128 GB. All the nodes are interconnected with a fast Infiniband QDR network and have a 1.7 TB fast RAID setup for scratch disk space. All the local disks are furthermore gathered in a a global 31TB Fraunhofer filesystem (FHGFS).

Suitable for:

Large shared-memory jobs (100+GB of RAM and 24+ cores), 5 days max.

Resources

  • Home directory (50GB quota per user)
  • Working directory /globalfs ($GLOBALSCRATCH)
  • Local working directory /scratch ($LOCALSCRATCH)
  • Nodes have access to internet
  • Low, Middle, High queues* (15 days max, 40 running jobs per user max)
  • Fast queue* (24 hours, no access to $GLOBALSCRATCH)

Access/Support:

SSH to hmem.cism.ucl.ac.be (port 22) with the appropriate login and id_rsa.ceci file.

Documentation: http://www.cism.ucl.ac.be/doc

SUPPORT: egs-cism@listes.uclouvain.be

Server SSH key fingerprint: (What's this?)
MD5: 06:54:39:a0:5c:b5:56:b3:29:9e:96:67:a0:4a:c1:ff
SHA256: Xi4r0aNViNgg9KjnENiUFkEWPwnJGAjbknlX+m7CIm0

ZENOBE

Hosted at, and operated by, Cenaero, it features a total of 13.536 cores (Haswell and Ivybridge) with up to 64 GB of RAM, interconnected with a QDR/FDR mixed Infiniband network, and having access to a fast 350 TB GPFS parallel filesystem.

Suitable for:

Massively parallel jobs (MPI, several hundreds cores) with many communications and/or a lot of parallel disk I/O, 1 day max.

Resources

  • Home directory (50 GB quota per user)
  • Working directory /SCRATCH
  • Project directory /projects
  • Large queue (1 day max walltime, 96 CPUs minimum and 4320 CPUs maximum per jobs, whole node allocation)
  • Default queue (no time limit but jobs must be restartable)

Access/Support:

SSH to the gateway hpc.cenaero.be (port 22) with the appropriate login and id_rsa.ceci file. From there, SSH to zenobe

QUICKSTART: www.ceci-hpc.be/zenobe.html
DOC: tier1.cenaero.be/en/faq-page
ABOUT tier1.cenaero.be

SUPPORT: it@cenaero.be

Gateway server SSH key fingerprint: (What's this?)
Either
MD5: c8:b8:14:e0:b7:cd:f7:01:88:f3:0b:af:7c:2a:1d:15
SHA256: pX/VUcbicyMTZWH0ph+XTjiaKNldovIyDaZsXFmhaCs
or
MD5: 10:17:41:a3:f3:87:a2:17:66:91:a9:af:d9:b1:cc:12
SHA256: OaIhBXyLQKvt3rNWnieDy2GtiE/LM3sCn5AowEx9Gh0

© CÉCI.