0% found this document useful (0 votes)
69 views9 pages

Cluster Installation LAB

The document outlines the steps for installing a high-performance computing cluster using Ubuntu 22.04, including prerequisites, installation of necessary software, and configuration of network settings. It details the setup of components such as NFS for shared storage, MUNGE for authentication, and SLURM for job scheduling, along with troubleshooting tips. The document serves as a comprehensive guide for configuring a computing cluster for academic or research purposes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views9 pages

Cluster Installation LAB

The document outlines the steps for installing a high-performance computing cluster using Ubuntu 22.04, including prerequisites, installation of necessary software, and configuration of network settings. It details the setup of components such as NFS for shared storage, MUNGE for authentication, and SLURM for job scheduling, along with troubleshooting tips. The document serves as a comprehensive guide for configuring a computing cluster for academic or research purposes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

High Performance Computing Semester II 2025

Special Workshop: Cluster Sep 12, 2025


Installation Prof. Francisco Hidrobo

Group:, Edwin Quizpe, Lander Lliguicota, Wilfrido Idrovo,

Steps for installing the cluster

Prerequisites, equipment, and materials


• 5 PC: 8–16 GB RAM (Optional: NVIDIA GPUs in some compute nodes.)
• Network cable (cat16) and connectors
• Ethernet switch
• Router
• Pick one “admin” user for everyone, e.g., hpc (UID/GID consistent across nodes).

Note: for our group we applied this lab on c01 node.

1) Install Ubuntu 22.04 (all nodes)


On each node:
1. Install Ubuntu Server 22.04 LTS (minimal)
2. Create the same user on all nodes (e.g., hpc) with the same password.
3. Set hostname appropriately during install (e.g., head, c01, …).
4. Configure static IPs (or set later via Netplan).

After first boot on each node, run :


sudo apt update && sudo apt -y full-upgrade
sudo apt -y install build-essential git cmake htop tmux bash-completion wget curl vim\
net-tools iproute2 openssh-server nfs-common chrony

# Optional:
echo 'set -o vim' | sudo tee -a /etc/[Link] >/dev/null
(Optional) Netplan static IP example
On each node, edit /etc/netplan/*.yaml (adjust NIC name & IPs):
network:
version: 2
ethernets:
enp3s0:
dhcp4: no
addresses: [[Link]/24] # head; use .11, .12, .13, .14 on c01..c04
gateway4: [Link] # set if this LAN has internet
nameservers:
addresses: [[Link],[Link]]
Apply: sudo netplan apply.
sudo nano /etc/cloud/[Link].d/[Link]
network: {config: disabled}
sudo reboot

/etc/hosts (all nodes)


Add the cluster host mappings on every node:
[Link] head
[Link] c01
[Link] c02
[Link] c03
[Link] c04
Time sync (all nodes)
Keep clocks aligned (critical for MUNGE/SLURM):
sudo sed -i 's/^pool .*/pool [Link] iburst/' /etc/chrony/[Link]
sudo systemctl enable --now chrony
chronyc sources -v

2) Passwordless SSH (from head → compute nodes)


On head (as hpc):
ssh-keygen -t ed25519 # press enter to accept defaults
ssh-copy-id hpc@c01
ssh-copy-id hpc@c02
ssh-copy-id hpc@c03
ssh-copy-id hpc@c04
Test: ssh c01 'hostname && whoami'.

3) NFS: shared home/apps (head exports, compute mounts)


On head:
sudo apt -y install nfs-kernel-server
# Create shares
sudo mkdir -p /home # already exists; ensure enough space
sudo mkdir -p /opt/apps /scratch
sudo chown -R root:root /opt/apps /scratch
sudo chmod 1777 /scratch
Edit /etc/exports (on head):
/home [Link]/24(rw,sync,no_subtree_check,no_root_squash)
#/opt/apps [Link]/24(rw,sync,no_subtree_check,no_root_squash)
#/scratch [Link]/24(rw,sync,no_subtree_check,no_root_squash)
Export & enable:
sudo exportfs -ra
sudo systemctl enable --now nfs-server
On each compute node add mounts to /etc/fstab:
head:/home /home nfs defaults,_netdev 0 0
#head:/opt/apps /opt/apps nfs defaults,_netdev 0 0
#head:/scratch /scratch nfs defaults,_netdev 0 0
Mount now:
sudo systemctl daemon-reload
sudo mount -a
Test: on a compute node, create a file in your home and see it on head.
Tip (workshop resilience): if you worry about NFS hiccups, you can mount /home with
nofail,[Link],_netdev to avoid boot stalls.

Illustration 1 Creating files and directions examples

4) MUNGE (auth for SLURM)


Install on all nodes:
sudo apt -y install munge libmunge-dev
On head, generate key and distribute:
sudo /usr/sbin/create-munge-key
# permissions are set by the tool: /etc/munge/[Link] (600, owner munge)

# copy key securely to each compute node


for n in c01 c02 c03 c04; do
sudo cp /etc/munge/[Link] /tmp/
sudo chmod 777 /tmp/[Link]
scp /tmp/[Link] $n:/tmp/
ssh -t $n "sudo mv /tmp/[Link] /etc/munge/[Link] && \
sudo chown munge:munge /etc/munge/[Link] && \
sudo chmod 600 /etc/munge/[Link]"
done
Enable on all nodes:
sudo systemctl enable --now munge
munge -n | unmunge | head
The unmunge output should show a valid credential.
Illustration 2 Munge status ON

Illustration 3 Successful Munge Test: Credential Validation

5) SLURM (controller on head, daemons on compute)


Install packages:
• On head:
1. sudo apt -y install slurmctld slurm-client
• On each compute node:
2. sudo apt -y install slurmd slurm-client
5.1 Generate a base [Link]
On head, install helper and create config:
sudo apt -y install slurm-wlm-basic-plugins # often pulled in already
# Determine resources on each node:
for n in head c01 c02 c03 c04; do
echo "=== $n ==="
ssh -t $n "hostname; lscpu | egrep 'Model name|Socket|Core|Thread'; free -m | awk '/Mem:/{print
\"MemMB:\", \$2}'"
done
Use the counts you see (cores, threads, MemMB) to fill the template below.
Create /etc/slurm/[Link] on head (then copy to compute nodes):
# /etc/slurm/[Link]
ClusterName=hpcworkshop
SlurmctldHost=head
MpiDefault=pmix
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory

# Logging
SlurmctldLogFile=/var/log/slurm/[Link]
SlurmdLogFile=/var/log/slurm/[Link]

# Auth
AuthType=auth/munge
CryptoType=crypto/munge

# Accounting (simple; skip SlurmDBD)


AccountingStorageType=accounting_storage/none
JobAcctGatherType=jobacct_gather/linux

# Schedulers
SchedulerType=sched/backfill
SlurmctldTimeout=120
SlurmdTimeout=300

# Node definitions (edit CPUs and RealMemory for your hardware)


NodeName=c01 CPUs=8 RealMemory=15000
NodeName=c02 CPUs=8 RealMemory=15000
NodeName=c03 CPUs=8 RealMemory=15000
NodeName=c04 CPUs=8 RealMemory=15000
PartitionName=main Nodes=c01,c02,c03,c04 Default=YES MaxTime=2-00:00 State=UP
Replace CPUs with the usable logical cores per node; set RealMemory ~ 90–95% of free
-m MemMB.
Create /etc/slurm/[Link] on all nodes:
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
Distribute configs from head:
for n in c01 c02 c03 c04; do
ssh -t $n "sudo mkdir -p /etc/slurm && mkdir /tmp/slurm "
scp /etc/slurm/[Link] $n:/tmp/slurm/
scp /etc/slurm/[Link] $n:/tmp/slurm/
ssh -t $n "sudo mv /tmp/slurm/* /etc/slurm/"
done
Enable services:
• head:
3. sudo systemctl enable --now slurmctld
sudo systemctl status slurmctld --no-pager
• compute nodes:
4. for n in c01 c02 c03 c04; do
ssh -t $n "sudo systemctl enable --now slurmd && sudo systemctl status slurmd --no-
pager"
done
Basic check:
Service :

Illustration 4 Slurm service ON


Sinfo

Illustration 5 Sinfo capture


scontrol show nodes
NodeName=c01 Arch=x86_64 CoresPerSocket=4
CPUAlloc=0 CPUTot=8 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=c01 NodeHostName=c01 Version=21.08.5
OS=Linux 5.15.0-153-generic #163-Ubuntu SMP Thu Aug 7 [Link] UTC 2025
RealMemory=7829 AllocMem=0 FreeMem=6767 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=main
BootTime=2025-09-12T[Link] SlurmdStartTime=2025-09-12T[Link]
LastBusyTime=2025-09-12T[Link]
CfgTRES=cpu=8,mem=7829M,billing=8
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Example of one of node information when we put scontrol command

You should see the nodes in IDLE or UNKNOWN (becomes IDLE after the first contact).
6) OpenMPI (all nodes)
sudo apt -y install openmpi-bin libopenmpi-dev
mpirun --version
Quick MPI Hello World:
// hello.c
#include <mpi.h>
#include <stdio.h>
int main(int argc, char **argv){
MPI_Init(&argc, &argv);
int rank,size; MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
char name[MPI_MAX_PROCESSOR_NAME]; int len; MPI_Get_processor_name(name,&len);
printf("Hello from rank %d of %d on %s\n", rank, size, name);
MPI_Finalize(); return 0;
}
Compile in the shared /home:

mpicc hello.c -o hello


Run through SLURM (preferred):v
srun -N 2 -n 4 ./hello # 2 nodes, 4 tasks total

The error occurs because SLURM uses PMIx while Open MPI was built with PMI2 (or no support),
so srun fails. Using salloc with mpirun --mca plm slurm works because Open MPI directly handles
process launching over the allocated nodes

Code Output from c01:

hpc@c01:~/will-lan$ salloc -N 4 -n 4 bash -lc 'mpirun --mca plm slurm --map-by ppr:1:node -np
4 ./hello'
salloc: Granted job allocation 111
Hello from rank 0 of 4 on c01
Hello from rank 1 of 4 on c02
Hello from rank 2 of 4 on c03
Hello from rank 3 of 4 on c04
salloc: Relinquishing job allocation 111
hpc@c01:~/will-lan$

# or:
sbatch <<'EOF'
#!/bin/bash
#SBATCH -J mpitest
#SBATCH -N 2
#SBATCH -n 8
#SBATCH -t [Link]
srun ./hello
EOF

9) Sanity tests (end-to-end


1. ssh c01 'hostname' (from head) works without password.
2. NFS: touch ~/testfile on c01 appears on head and c02.
3. MUNGE: munge -n | ssh c01 unmunge | head shows a valid credential.
4. SLURM: sinfo, scontrol show nodes, srun -N1 -n2 hostname.
5. MPI job via srun prints messages from multiple nodes.

10) Test

10.1) Test SLURM partition assignment


One-liner (interactive)
srun -p main -N1 -n1 -t 2:00 hostname && scontrol show job $SLURM_JOB_ID
Minimal batch script
Create part_check.sh:
#!/bin/bash
#SBATCH -J partcheck
#SBATCH -p main
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t [Link]

echo "JobID: $SLURM_JOBID"


echo "Partition: $SLURM_JOB_PARTITION"
echo "Node(s): $SLURM_NODELIST"
echo "CPUs: $SLURM_CPUS_ON_NODE"
hostname
Submit:
sbatch part_check.sh
squeue -u $USER

10.2) Test an MPI program


Code: hello.c
#include <mpi.h>
#include <stdio.h>
int main(int argc, char **argv){
MPI_Init(&argc,&argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
char name[MPI_MAX_PROCESSOR_NAME]; int len;
MPI_Get_processor_name(name,&len);
printf("Hello from rank %d of %d on %s\n", rank, size, name);
MPI_Finalize();
return 0;
}
Compile once (on shared /home):
mpicc hello.c -O2 -o hello
Run via SLURM across two nodes
Create mpi_test.sh:
#!/bin/bash
#SBATCH -J mpihello
#SBATCH -p main
#SBATCH -N 2 # two nodes
#SBATCH -n 8 # 8 MPI ranks total
#SBATCH -t [Link]

srun ./hello
Submit:
sbatch mpi_test.sh

Troubleshooting quick hits


• Nodes not showing in sinfo: Check sudo systemctl status slurmd on compute nodes; verify
/etc/slurm/[Link] matches head and times are in sync (chronyc tracking).
• slurmctld won’t start: Run sudo tail -n 100 /var/log/slurm/[Link]; typical issues are
bad NodeName stanza or wrong hostnames.
• MUNGE failures: Ensure the same /etc/munge/[Link] on ALL nodes, chmod 600,
owner munge:munge, and clocks synced.
• NFS hangs: Check showmount -e head, sudo exportfs -v. Try _netdev,automount in
/etc/fstab.
• MPI can’t launch on multiple nodes: Prefer srun instead of raw mpirun (lets SLURM
handle allocation). Example: srun -N2 -n8 ./hello.

You might also like