0% found this document useful (0 votes)

105 views

Carlos Fenoy García: Real-Time Monitoring Slurm Jobs With Influxdb September 2016

The document describes a solution for real-time monitoring of Slurm jobs using InfluxDB. It outlines issues with current Slurm profiling using HDF5 files and proposes exporting profiling data to an InfluxDB database for centralized storage and retrieval. Key metrics on CPU, memory, and I/O usage are collected and sent to InfluxDB using its REST API. This allows monitoring of individual job usage from a dashboard in Grafana for faster issue detection and better understanding of cluster usage.

Uploaded by

wwongvg

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views

Carlos Fenoy García: Real-Time Monitoring Slurm Jobs With Influxdb September 2016

Uploaded by

wwongvg

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Real-time monitoring Slurm jobs with InfluxDB

September 2016

Carlos Fenoy García

Agenda

•  Problem description

•  Current Slurm profiling

•  Our solution

•  Conclusions
Problem description

•  Monitoring of jobs is becoming more difficult with new systems with higher
amount of resources as jobs tend to share compute nodes.

•  “Standard” monitoring tools hide the individual job usage in the compute
host resource monitoring
Current Slurm profiling

•  Slurm support profiling of applications using HDF5 as storage

–  It gets resource usage every few seconds
–  Stores the information in an HDF5 file per host
–  Once the job is finished the users have to merge all the .hd5 files to
create a single per job file
Current Slurm profiling (II)

•  Pros
–  No need for a central monitoring storage or to send data though
network
–  Uses the existing shared filesystem
–  Light-weight collection and storage of data

•  Cons
–  If one node dies, the HDF5 file may be corrupt and irrecoverable
–  No data can be retrieved until the job finishes
–  Filesystem can not be mounted with root squash
Our solution

•  Using the same base as the HDF5

profiling plugin, export the
information to an InfluxDB server

•  Collects exactly the same

information as the HDF5 plugin

•  A small buffer is used to avoid

sending data for every sample
collected

•  Information is sent to the central

server using libcurl
InfluxDB and Grafana

•  “InfluxDB is an open source database written in Go specifically to handle

time series data with high availability and high performance requirements.”
influxdata.com

•  InfluxDB has a REST API to insert and query data

•  Integrated with Grafana for nice dashboards

Metrics collected

Default metrics:
CPUFrequency RSS
CPUTime ReadMB
CPUUtilization WriteMB
Pages

Additional profiling plugins it is possible to collect information from Infiniband, Lustre

and Energy
Configuration

•  3 new parameters added to the acct_gather.conf file

–  ProfileInfluxDBHost: the host where to send the data to
–  ProfileInfluxDBDatabase: the database in influx where to store the data
–  ProfileInfluxDBDefault: Default profiling level

•  Default profiling level set to ALL if nothing else specified to be able to also
collect information from the job script
Sending data to InfluxDB

•  A small 16KB buffer is used to aggregate some data before sending

•  The influx line protocol is used to send the data

–  METRIC,( TAGS ) value=VALUE ( TIMESTAMP )

–  CPUTime job=24,step=1,task=2,host=node001 value=99 1460713153

•  Floating point data is sent with 2 decimals precission

Sending data (II)

•  Information is sent through curl to the database server

–  INFLUXDB_SERVER/write?db=slurm&rp=default&precision=s

–  If an error is returned by the server the data is dropped

–  Some profiling data may be lost

•  You can also send the data to a Logstash server to store it in a different DB.
Our solution (II)

•  Pros
–  Light-weight collection and storage of data
–  All the information is available almost in real-time
–  No information stored locally on the nodes, and no possibility of data
corruption due to a node crash
–  Information available per job/task enhances understanding of the
usage

•  Cons
–  Needs a central server where to send all the collected data.
Examples
Examples
Examples
Conclusions

•  Easy to setup monitoring system

–  1 daemon
–  1 config file in the compute nodes

•  Real-time monitoring => faster reactions to issues

•  Better monitoring => better understanding of the usage of the cluster

•  Monitoring information related to jobs and not only nodes

GITHUB
https://github.com/cfenoy/influxdb-slurm-monitoring
References

•  InfluxDB: http://www.influxdata.com

•  Grafana: http://www.grafana.org

•  Slurm: http://slurm.schedmd.com

•  Slurm profiling: http://slurm.schedmd.com/hdf5_profile_user_guide.html

Doing now what patients need next

Econ 101: Principles of Microeconomics Fall 2012
No ratings yet
Econ 101: Principles of Microeconomics Fall 2012
1 page
DPU4F New
No ratings yet
DPU4F New
21 pages
Redhat Linux 7 RHCSA and RHCE Exam Model Questions
60% (25)
Redhat Linux 7 RHCSA and RHCE Exam Model Questions
45 pages
Hepsysman Influxdb Grafana v1
No ratings yet
Hepsysman Influxdb Grafana v1
35 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
MLOps Research Work by Arka Roy (1)
No ratings yet
MLOps Research Work by Arka Roy (1)
21 pages
Lectur 5
No ratings yet
Lectur 5
37 pages
San Ia 2
No ratings yet
San Ia 2
13 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
UNIT IV_ioT_1
No ratings yet
UNIT IV_ioT_1
27 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
UNIT-4
No ratings yet
UNIT-4
119 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
IT Infrastructure Architecture: Infrastructure Building Blocks and Concepts
No ratings yet
IT Infrastructure Architecture: Infrastructure Building Blocks and Concepts
42 pages
ASE Migration To Linux
No ratings yet
ASE Migration To Linux
26 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
BDA Unit 2 1
No ratings yet
BDA Unit 2 1
42 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Module II
No ratings yet
Module II
46 pages
01 - Hadoop - HDFS
No ratings yet
01 - Hadoop - HDFS
49 pages
Big Data Analytics unit wise short note
No ratings yet
Big Data Analytics unit wise short note
6 pages
22241A66C5_Assignment21
No ratings yet
22241A66C5_Assignment21
16 pages
Big Data and Cloud Computing
No ratings yet
Big Data and Cloud Computing
27 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Hadoop Ankit
No ratings yet
Hadoop Ankit
20 pages
Hadoop Training in Bangalore
No ratings yet
Hadoop Training in Bangalore
31 pages
Introduction To Cloud Databases: Lecturer: Dr. Pavle Mogin
No ratings yet
Introduction To Cloud Databases: Lecturer: Dr. Pavle Mogin
23 pages
ISM_Lecture5
No ratings yet
ISM_Lecture5
44 pages
Big Data NoSLQ Kopyası
No ratings yet
Big Data NoSLQ Kopyası
51 pages
Module III
No ratings yet
Module III
33 pages
Welcome To The New Era of Cloud Computing: The Web Is Replacing The Desktop
No ratings yet
Welcome To The New Era of Cloud Computing: The Web Is Replacing The Desktop
36 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
24 pages
Hadoop
No ratings yet
Hadoop
25 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
lec09-no-sql
No ratings yet
lec09-no-sql
42 pages
BDT Unit04
No ratings yet
BDT Unit04
89 pages
BDT Unit04
No ratings yet
BDT Unit04
89 pages
File Storage in GCP
No ratings yet
File Storage in GCP
11 pages
Spark
No ratings yet
Spark
36 pages
Architecture Reliability
No ratings yet
Architecture Reliability
16 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Commoncrawlpresentation 101027182938 Phpapp02
No ratings yet
Commoncrawlpresentation 101027182938 Phpapp02
17 pages
Cloud Computing Reviewer 2
No ratings yet
Cloud Computing Reviewer 2
11 pages
DataStage Adv Bootcamp All Presentations
100% (1)
DataStage Adv Bootcamp All Presentations
316 pages
File System Design For and NSF File Server Appliance: Dave Hitz, James Lau, and Michael Malcolm
No ratings yet
File System Design For and NSF File Server Appliance: Dave Hitz, James Lau, and Michael Malcolm
26 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Ceph
No ratings yet
Ceph
40 pages
UNIT-2 IMP QUES ANS
No ratings yet
UNIT-2 IMP QUES ANS
8 pages
Module 1 - Introduction and Overview of Computer Architecture
No ratings yet
Module 1 - Introduction and Overview of Computer Architecture
50 pages
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
No ratings yet
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
56 pages
Hadoop Distributed File System Basics
No ratings yet
Hadoop Distributed File System Basics
30 pages
CC04 Chapter 1
No ratings yet
CC04 Chapter 1
7 pages
Oracle Multitenant Deployment On Oracle Exadata: Vivek Puri
No ratings yet
Oracle Multitenant Deployment On Oracle Exadata: Vivek Puri
31 pages
cc unit 51
No ratings yet
cc unit 51
39 pages
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
Econ 101: Principles of Microeconomics Fall 2012: Homework #3
No ratings yet
Econ 101: Principles of Microeconomics Fall 2012: Homework #3
2 pages
Econ 101: Principles of Microeconomics Fall 2012: Problem 1: Use The Following Graph To Answer The Questions
No ratings yet
Econ 101: Principles of Microeconomics Fall 2012: Problem 1: Use The Following Graph To Answer The Questions
3 pages
Homework #1 Answers: Econ 101: Principles of Microeconomics Fall 2012
No ratings yet
Homework #1 Answers: Econ 101: Principles of Microeconomics Fall 2012
4 pages
Homework #1: Econ 101: Principles of Microeconomics Fall 2012
No ratings yet
Homework #1: Econ 101: Principles of Microeconomics Fall 2012
2 pages
Econ 101: Principles of Microeconomics Fall 2012: Homework #2
No ratings yet
Econ 101: Principles of Microeconomics Fall 2012: Homework #2
1 page
Same Same, But Better: Comparing Artifactory To Other Binary Repository Managers
No ratings yet
Same Same, But Better: Comparing Artifactory To Other Binary Repository Managers
20 pages
CaseStudy Cisco Web
No ratings yet
CaseStudy Cisco Web
2 pages
Econ 101: Principles of Microeconomics Fall 2012: Problem 1
No ratings yet
Econ 101: Principles of Microeconomics Fall 2012: Problem 1
3 pages
The Rredis Package: Bryan W. Lewis and Contributions From Many Others July 5, 2015
No ratings yet
The Rredis Package: Bryan W. Lewis and Contributions From Many Others July 5, 2015
16 pages
Azure Kubernetes Service - Solution Booklet - Digital
No ratings yet
Azure Kubernetes Service - Solution Booklet - Digital
70 pages
Report For Project
No ratings yet
Report For Project
13 pages
Overview
No ratings yet
Overview
9 pages
CFA 2 Notes
No ratings yet
CFA 2 Notes
44 pages
CFA Level II 2017 Curriculum Changes
No ratings yet
CFA Level II 2017 Curriculum Changes
1 page
Disk Cloning - ArchWiki
No ratings yet
Disk Cloning - ArchWiki
10 pages
Z80 - Instruction Set
No ratings yet
Z80 - Instruction Set
36 pages
Exploring The Cloud With Amazon Web Services: Nallam Jitin
No ratings yet
Exploring The Cloud With Amazon Web Services: Nallam Jitin
9 pages
Digital Circuit Projects by Dr. Charles W. Kann.
No ratings yet
Digital Circuit Projects by Dr. Charles W. Kann.
122 pages
6.189 Lecture5 Parallelism
No ratings yet
6.189 Lecture5 Parallelism
63 pages
3.3.12 Lab Windows Task Manager
No ratings yet
3.3.12 Lab Windows Task Manager
9 pages
Workshop Layout 2
0% (1)
Workshop Layout 2
1 page
Be Computer Engineering Semester 5 2023 November Systems Programming and Operating System SP Os Pattern 2019
No ratings yet
Be Computer Engineering Semester 5 2023 November Systems Programming and Operating System SP Os Pattern 2019
2 pages
How To Upgrade The Firmware of EAP Products
No ratings yet
How To Upgrade The Firmware of EAP Products
5 pages
Breed - Manual-En
No ratings yet
Breed - Manual-En
48 pages
1TGC901092M0201 - Dual Redundancy Guide V2.3
No ratings yet
1TGC901092M0201 - Dual Redundancy Guide V2.3
14 pages
Clean Log
No ratings yet
Clean Log
12 pages
ZTC Edge 200i 250i Datasheet Final - 6 - 29 - 23
No ratings yet
ZTC Edge 200i 250i Datasheet Final - 6 - 29 - 23
4 pages
Memory and Its Types: Primary and Secondary Memory M Husnain BS Chemistry 4th Bsf1907949
No ratings yet
Memory and Its Types: Primary and Secondary Memory M Husnain BS Chemistry 4th Bsf1907949
7 pages
Amd Firepro™ Unified Driver 15.201.2401 Release Notes
No ratings yet
Amd Firepro™ Unified Driver 15.201.2401 Release Notes
8 pages
Number System Conversion2
No ratings yet
Number System Conversion2
46 pages
Philips Update
No ratings yet
Philips Update
3 pages
Husky Studio Guide
0% (1)
Husky Studio Guide
20 pages
Cloud Computing - Midsem
No ratings yet
Cloud Computing - Midsem
591 pages
Introduction To Information & Communication Technology Lab: Bahria University, Islamabad
No ratings yet
Introduction To Information & Communication Technology Lab: Bahria University, Islamabad
12 pages
ACA Unit 4
No ratings yet
ACA Unit 4
27 pages
Unix - Module 1
No ratings yet
Unix - Module 1
38 pages
Day 1-2 SQL Server Architecture
No ratings yet
Day 1-2 SQL Server Architecture
109 pages
Attachment Q30 COCoCCA Registry Services (NZ) Limited - Security PolicyCCA SecurityPolicy
No ratings yet
Attachment Q30 COCoCCA Registry Services (NZ) Limited - Security PolicyCCA SecurityPolicy
3 pages
02&3 APT2022 Lab2&3 Assembly Language
No ratings yet
02&3 APT2022 Lab2&3 Assembly Language
14 pages
RM0316 STM32F3xx RM PDF
No ratings yet
RM0316 STM32F3xx RM PDF
965 pages
FortiSIEM 6.6.0 Release Notes
No ratings yet
FortiSIEM 6.6.0 Release Notes
12 pages
DM00036049 - STM32F41x In-Application Programming Using The USART
No ratings yet
DM00036049 - STM32F41x In-Application Programming Using The USART
13 pages