Intro To Slurm
Intro To Slurm
JOB SCHEDULER
WHAT IS SLURM?
Slurm is an open source cluster management and job scheduling system for Linux clusters.
www.slurm.schedmd.com
PARTITIONS
production
Standard CPU nodes
(default)
Standard CPU nodes for debug
debug
(fast allocation times)
maxwell Nodes with Nvidia Maxwell GPUs
pascal Nodes with Nvidia Pascal GPUs
mic Nodes with Intel Xeon Phi cards
1
2 John
3 12 CPU cores
4 1 week
5
6 Mark
7 2 CPU cores
5 hours
8
9
10 Lucy
11 1 CPU core
7 hours
12
i 1 2 3 4 5 6 7 8 hours
DETERMINE RESOURCES FOR JOB - OPTIMIZATION
Lower
Optimize resources request More
queue
research
wait time
SHEBANG
myjob.slurm
• Specify the script interpreter (Bash)
• Must be the first line! #!/bin/bash
--nodes=N
• Request N nodes to be allocated. (Default: N=1)
--ntasks=N
• Request N tasks to be allocated. (Default: N=1)
• Unless otherwise specified, one task maps to one CPU core.
--mem=NG
• Request N gigabytes of memory per node. (Default: N=1)
--time=d-hh:mm:ss
• Request d days, hh hours, mm minutes and ss seconds. (Default: 00:15:00)
--job-name=<string>
• Specify a name for the job allocation. (Default: batch file name)
--output=<file_name>
i •
•
Write the batch script’s standard output in the specified file.
If not specified the output will be saved in the file: slurm-<jobid>.out
CREATE A BATCH JOB SCRIPT - EMAIL NOTIFICATION
--mail-user=<address>
• Send email to address.
• It accepts multiple comma separated addresses.
--mail-type=<event>
• Define the events for which you want to be notified:
i
SUBMIT JOB TO THE SCHEDULER
sbatch batch_file
• Submit batch_file to Slurm.
• If successful, it returns the job ID of the submitted job.
scancel jobid
• Cancel the job corresponding to the
i How do I remove a job from the queue?
given jobid from the queue.
SUBMIT JOB TO THE SCHEDULER
FAIRSHARE AGE
i PRIORITY
CHECK JOB STATUS
squeue -u vunetid
• Show the queued jobs for user vunetid.
NODELIST (REASON)
• For running jobs shows the allocated nodes.
• For pending jobs shows the wait reason:
STATUS
Priority Other jobs in queue have higher priority.
R = Running
Resources Insufficient resources available on the cluster.
PD = Pending
Reached maximum number of allocated CPUs by
CA = Cancelled AssocGrpCpuLimit
all jobs belonging to the user’s account.
Reached maximum amount of allocated memory
AssocGrpMemLimit
by all jobs belonging to the user’s account.
i AssocGrpTimeLimit
Reached maximum amount of allocated time by
all jobs belonging to the user’s account.
RETRIEVE JOB INFORMATION
rtracejob jobid
• Print requested and utilized resources (and more) for the given jobid.
i
JOB ARRAYS
--array=0-7 0, 1, 2, 3, 4, 5, 6, 7
my_program file_${SLURM_ARRAY_TASK_ID}
my_program file_4 job_1234567_task_4.out
JOB ARRAYS
#!/bin/bash
#SBATCH
…
1 2 3 4
myfile=$( ls DataDir | awk -v line=${SLURM_ARRAY_TASK_ID} ‘{if (NR==line) print $0}’ )
5 my_program ${myfile}
1 Get the list of files names in the data directory in alphabetical order
2 Send the list to awk
3 Pass the value of the bash variable SLURM_ARRAY_TASK_ID to the awk variable “line”
4 Print only the NRth line in the list of files names for which NR corresponds to the job task ID
5 Pass the file name in the myfile variable to the main program
MULTITHREADED JOBS
POSIX THREADS
1 node 1 node
1 task 2 tasks
8 CPUs per task 4 CPUs per task
MULTITHREADED JOBS
--cpus-per-task=N
• Request N CPU cores to be allocated for each task.
2 nodes
8 tasks per node
1 CPU per task
DISTRIBUTED MEMORY JOBS
--nodes=N
• Request N nodes to be allocated.
--tasks-per-node=N
• Request N tasks per node.
• Unless otherwise specified, one task maps to one CPU core.
salloc options
• Obtain job allocation with shell access.
• Accepts all the same options previously seen for sbatch.
Gateway
Compute node
qSummary -g group
• Show the total number of jobs and CPU cores allocated
or waiting for allocation for the selected group.
Check overall cluster utilization
showLimits -g group
3 or logic errors.
www.accre.vanderbilt.edu/slurm
NEED MORE HELP?
www.accre.vanderbilt.edu/faq
www.accre.vanderbilt.edu/help