Carlos Fenoy García: Real-Time Monitoring Slurm Jobs With Influxdb September 2016
Carlos Fenoy García: Real-Time Monitoring Slurm Jobs With Influxdb September 2016
September 2016
• Problem description
• Our solution
• Conclusions
Problem description
• Monitoring of jobs is becoming more difficult with new systems with higher
amount of resources as jobs tend to share compute nodes.
• “Standard” monitoring tools hide the individual job usage in the compute
host resource monitoring
Current Slurm profiling
• Pros
– No need for a central monitoring storage or to send data though
network
– Uses the existing shared filesystem
– Light-weight collection and storage of data
• Cons
– If one node dies, the HDF5 file may be corrupt and irrecoverable
– No data can be retrieved until the job finishes
– Filesystem can not be mounted with root squash
Our solution
Default metrics:
CPUFrequency RSS
CPUTime ReadMB
CPUUtilization WriteMB
Pages
• Default profiling level set to ALL if nothing else specified to be able to also
collect information from the job script
Sending data to InfluxDB
• You can also send the data to a Logstash server to store it in a different DB.
Our solution (II)
• Pros
– Light-weight collection and storage of data
– All the information is available almost in real-time
– No information stored locally on the nodes, and no possibility of data
corruption due to a node crash
– Information available per job/task enhances understanding of the
usage
• Cons
– Needs a central server where to send all the collected data.
Examples
Examples
Examples
Conclusions
• InfluxDB: http://www.influxdata.com
• Grafana: http://www.grafana.org
• Slurm: http://slurm.schedmd.com