0% found this document useful (0 votes)
4 views

SLURM-HPC

SLURM is a resource manager and job scheduler for Linux clusters, designed to execute parallel jobs, allocate resources, and manage job scheduling using complex algorithms. It is open-source, fault-tolerant, and highly scalable, with a wide range of plugins for various functionalities. Key commands include sbatch for job submission, salloc for job allocation, and sinfo for system status reporting.

Uploaded by

Patron Sane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

SLURM-HPC

SLURM is a resource manager and job scheduler for Linux clusters, designed to execute parallel jobs, allocate resources, and manage job scheduling using complex algorithms. It is open-source, fault-tolerant, and highly scalable, with a wide range of plugins for various functionalities. Key commands include sbatch for job submission, salloc for job allocation, and sinfo for system status reporting.

Uploaded by

Patron Sane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

HPC

SLURM
Resource
Manager and
job scheduler
LEARNING

DR@B. DIOP
Role of resource manager

SLURM

Execute parallele jobs


Role of resource manager

SLURM

Allocate resources within a cluster


Launches and manages jobs
Schedule works by managing queues
using complex scheduling algorithms
What is SLURM
SLURM

Simple Linux Utility for Resource Management


Started in 2002 as a simple resource
management for Linux clusters
Used on many of the world largest computers
+500 l000 lines of code today
What is SLURM
SLURM

Small and simple


Open source v2 GPL
Fault tolerant - Secure
Portable
System admin friendly
Highly scalable
What is SLURM
SLURM

No kernel modifications
C language
Skeleton functionality can be extended using
plugin
Various system specific plugins available
Plugins
SLURM

70 plugins
Storage : MySQL, PostgreSQL
Network topology : 3D-torus, tree
MPI : OPenMPI, MPICH1, MVAPICH, MPICH2,
Plugins developement
SLURM

Job submit plugin


Call for each job submission or modification
Can be used to set default values
2 functions
job_submit()
job_modify()
SLURM
design and
architecture
HPC
CLuster architecture
SLURM
Daemons
SLURM

slurmcltd: central controller


slurmd: compute node daemon
slurmdbd: database daemon
Exercise: describe in details the use of such
daemons
Daemon command line options
SLURM

-c: clear previous stte


-D: run in foreground
-v: verbose
Example:
slurmctld -Dcvvvv
slurmd -Dcvvv
Compute node config
SLURM

Execute slurmd with -C option to print node's


current config and exit
Can be used as input to the SLURM config file
Shepherd a job step
SLURM

One slurmstepd per job step


Spawned by slurmd at job step initiation
Manages job steps and processes I/O
Only performs while the job step is active
SLURM
build and
configuration
HPC
SLURM commands : job/step allocation
SLURM

sbatch - submit script for later execution


salloc - create job allocation and start a shell
srun - Create a job allocation and launch job
sattach - connect stdin/out/err for an existing
job or job step
SLURM commands : job/step allocation
SLURM

sbatch - submit script for later execution


salloc - create job allocation and start a shell
srun - Create a job allocation and launch job
sattach - connect stdin/out/err for an existing
job or job step
Job/step allocation examples
Submit a sequence of three batch jobs
Job/step allocation examples
Create allocation for 2 tasks then launch "hostname" on the allocation
Job/step allocation examples
Create allocation for 8 tasks and 10 min for bash shell
Job execution sequence
About ?

1a- srun send job allocation request to slurmctld


1b- slurmctld grant allocation and returns details
2a- srun send step create request to slurmctld
2b- slurmctld responds with step credential
3- srun opens socket for I/O
4- srun forwards credential with task info to slurmd
5- slurmd forward request as needed
6- slurmd forks/execs slurmstepd
7- slurmstepd connects I/O to run and launches tasks
8- on task termination, slurmstepd notifies srun
9- srun notifies slurmcltd of job termination
10- slurmctld verifies termination of all processes via
slurmd and releases resources for next job
SLURM commands : system information
example

sinfo - report system status of nodes


squeue - report job and job step status
smap - report system, job or step status with topology
sview - report and/or update system, job step partition or
reservation status with topology
scontrol - admin tool to view/update system, job, step,
partition or reservation
sinfo commands
example

sinfo - report system status of nodes or partitions


squeue commands
example

squeue - report status of jobs/steps in slurmctld daemons records


scontrol commands
example

scontrol - designed for system administrator use


Many fields can be modified
SLURM commands : accounting
example

sacct - report accounting information by individual job


and job step
sstat - more details than sacct
sreport - report resources usage by cluster, partition, user,
account, etc.
Scheduling
example

sacctmgr - database management tool


add/delete clusters, accounts, users
get/set resource limits
sprio - view factors comprising a job's priority
sshare - view current hierarch. fair-share info
sdiag - view stats about scheduling module operations
Documentation
MORE

https://slurm.schedmd.com/documentation.html

You might also like