ESNODE | Source Available BUSL-1.1 | Copyright (c) 2024 Estimatedstocks AB

ESNODE-Core

This repository contains the source, build tooling, and documentation for the ESNODE-Core Agent. The ESNODE-Pulse controller is maintained separately (licensed) and is not part of this codebase.

Supported server OS targets (AI infra focused)

Ubuntu Server (primary, ~60–70% of AI fleet; best CUDA/driver/toolchain support)
RHEL-compatible: RHEL / Rocky Linux / AlmaLinux (enterprise compliance, FIPS-ready)
NVIDIA DGX OS (Ubuntu-based, pre-tuned for DGX appliances)
SUSE Linux Enterprise Server (SLES) (enterprise/HPC niches)
Debian (research/custom environments)

Packaging is provided as tar.gz plus optional deb/rpm; Windows zips are included for hybrid labs. On Linux, install scripts set up binaries under /usr/local/bin and create systemd units so esnode-core runs without extra PATH tweaks (ESNODE-Pulse ships separately).

ESNODE-Core is a GPU-aware host metrics exporter for Linux nodes. It exposes CPU, memory, disk, network, and GPU telemetry at /metrics in Prometheus text format so observability stacks can scrape a single endpoint per node. Agents can run standalone or attach to an ESNODE-Pulse control plane while keeping Prometheus/OTLP outputs unchanged.

GPU/MIG visibility and build requirements

GPU metrics require the gpu feature (enabled by default). MIG metrics additionally require building with the gpu-nvml-ffi feature and enabling enable_gpu_mig in config; otherwise MIG metrics will remain zero.
gpu_visible_devices (or NVIDIA_VISIBLE_DEVICES) filters which GPUs are scraped; empty/all scrapes all.
mig_config_devices (or NVIDIA_MIG_CONFIG_DEVICES) further filters which GPUs are considered for MIG scraping when enable_gpu_mig is true.
k8s_mode emits a small set of legacy-compatible GPU metrics with a single gpu label (suffix _compat) using Kubernetes/CDI resource-style names (nvidia.com/gpu, nvidia.com/gpu-<mig-profile>-mig) alongside the UUID+index labeled metrics to reduce dashboard breakage.
MIG compatibility labels are only emitted when k8s_mode is enabled; MIG info/metrics still require gpu-nvml-ffi + enable_gpu_mig.
enable_gpu_events starts an NVML event loop for XID/ECC/power/clock events (best-effort). The loop is non-blocking with a short poll and may miss very bursty streams; counters are monotonic but not guaranteed exact.
gpu-nvml-ffi-ext is an optional feature gate for extra NVML FFI (PCIe field counters, etc.). These are best-effort and unverified without suitable hardware; placeholders remain zero when unsupported.
NVSwitch/copy-engine clocks/power-cap reason codes are exposed as gauges but rely on NVML support; many remain zero on hardware/driver versions that do not surface them.

Features (v0.1 scope)

Single binary with zero-config defaults (0.0.0.0:9100, 5s interval).
Collectors: CPU, memory, disk, network, GPU (NVML-based; gracefully disabled if unavailable).
Power-aware: optional power collector reads RAPL/hwmon/BMC paths for CPU/package/node power; GPU power via NVML.
Self-observability: scrape duration + error counters per collector.
Health endpoint at /healthz.
JSON status endpoint at /status (/v1/status alias) with node load, power, temps, GPU snapshot (identity/health/MIG tree), last scrape/errors; SSE stream at /events for near-real-time loops.
Degradation signals: disk busy/latency, network drops/retrans, swap spikes, and GPU throttle/ECC flags roll up into esnode_degradation_score; surfaced in /status and the TUI.
Control-plane safety: orchestrator /orchestrator/* binds loopback-only by default; set orchestrator.allow_public=true to expose, and use orchestrator.token to require Authorization: Bearer <token>. Auth successes/failures and task/heartbeat actions emit structured audit logs (target audit).
Local TSDB defaults to a user-writable XDG path so non-root runs no longer fail on /var/lib/esnode/tsdb; override with local_tsdb_path when you want /var/lib.

Power- & Model-awareness (roadmap)

Power: keep esnode_gpu_power_watts and add optional esnode_cpu_package_power_watts, esnode_node_power_watts, and PDU/BMC readings where available (RAPL/hwmon/IPMI).
Model metadata: pair ESNODE with app-exported metrics such as model_tokens_total, model_requests_total, and latency histograms labeled by model/namespace.
Efficiency: use PromQL like sum(rate(model_tokens_total[5m])) / sum(rate(esnode_node_power_watts[5m])) for tokens-per-watt and scale to tokens-per-kWh; see docs/monitoring-examples.md for dashboards.

Licensing & Distribution

Source-available under ESNODE BUSL-1.1 (see LICENSE).
Trademarks governed by docs/TRADEMARK_POLICY.md; no rebranding or redistribution of binaries.
Contributions require agreement to docs/CLA.md.
Official binaries and commercial terms are controlled solely by Estimatedstocks AB.

Product split (effective now)

ESNODE-Core (public, source-available)
- License: current ESNODE BUSL-style terms.
- Usage: free for internal use; redistribution/trademark restrictions still apply.
- Distribution: public binaries at https://esnode.co/downloads.
ESNODE-Pulse (licensed-only, revenue-based; separate repo)
- License: proprietary.
- Revenue rule: ≤ USD 2M revenue → free starter license after registration; > USD 2M → paid subscription required before production use.
- Distribution: not in this repository. Binaries are provided only after registration via https://esnode.co/products#pulse (or designated portal).

Components

esnode-core: per-node collector exposing Prometheus /metrics, JSON /status (/v1/status), and SSE /events.
esnode-orchestrator: optional autonomous resource manager (embedded lib, CLI-configurable).
esnode-pulse: licensed controller that polls agents and centralizes policy/alerts (not included in this repository).

See docs/architecture.md and docs/platform-matrix.md for topology and build targets.

Build & Run

cargo build --workspace --release
./target/release/esnode-core

Cross-compiling on macOS for x86_64-unknown-linux-gnu/aarch64-unknown-linux-gnu requires the corresponding GNU toolchains (e.g., brew install x86_64-unknown-linux-gnu).

Configuration precedence: CLI flags > env vars > esnode.toml > defaults. See docs/quickstart.md for full examples.

Config flags of interest:
- enable_gpu_mig (default false) – turn on MIG scraping when built with gpu-nvml-ffi.
- enable_gpu_events (default false) – run NVML event loop (best-effort) for XID/ECC/clock/power events.
- enable_gpu_amd (default false) – experimental AMD/ROCm collector scaffolding; emits no metrics unless rsmi/rocm-smi support is added.
- k8s_mode (default false) – emit compatibility labels using Kubernetes/CDI resource names alongside UUID/index labels.
- gpu_visible_devices / NVIDIA_VISIBLE_DEVICES – filter visible GPUs.
- mig_config_devices / NVIDIA_MIG_CONFIG_DEVICES – filter MIG-capable GPUs when enable_gpu_mig is true.
- Optional gpu-nvml-ffi-ext feature enables additional NVML field-based counters (PCIe/etc.), best-effort only.
- enable_app + app_metrics_url – app/model metrics collector uses a 2s HTTP timeout; slow or hung endpoints are skipped for that scrape without blocking other collectors.
- Orchestrator control API (/orchestrator/*) is exposed only on loopback listeners by default; set orchestrator.allow_public=true explicitly if you need to serve it on non-loopback addresses and configure orchestrator.token to require Authorization: Bearer <token>.

Local TSDB path (default): when enable_local_tsdb is true, the agent now resolves local_tsdb_path to $XDG_DATA_HOME/esnode/tsdb or ~/.local/share/esnode/tsdb so non-root runs don’t fail on /var/lib. Set ESNODE_LOCAL_TSDB_PATH or the config key if you want /var/lib/esnode/tsdb and ensure the directory is writable by the agent user.

Releases & GitHub tagging

Tagging vX.Y.Z on the default branch triggers .github/workflows/release.yml, which:
- Runs cargo test --workspace --locked --target <triple> on Linux (x86_64), macOS (aarch64), and Windows (x86_64).
- Builds release binaries with default features for the same triples.
- Packages artifacts as tar.gz (Linux/macOS) or zip (Windows) and attaches them to the GitHub Release created from the tag.
Artifact names:
- esnode-core-linux-x86_64.tar.gz
- esnode-core-macos-aarch64.tar.gz
- esnode-core-windows-x86_64.zip
Binaries are built with default features; MIG metrics still require gpu-nvml-ffi and enable_gpu_mig when running on MIG-capable hosts.
For additional targets or feature builds, run cargo build --release --locked --target <triple> locally and publish as needed.
Manual packaging: scripts/dist/esnode-core-release.sh (optionally with ESNODE_VERSION=X.Y.Z) builds and collects Linux tar/deb/rpm bundles (and Windows zip if toolchain available) under public/distribution/releases/. Push a tag vX.Y.Z after verification to publish GitHub release artifacts automatically.

Community & policies:

Contribution guidelines: see CONTRIBUTING.md.
Code of Conduct: see CODE_OF_CONDUCT.md.
Security reporting: see SECURITY.md.
Support & upgrade policy: see docs/support-policy.md.
Metric label migration (UUID-first): see docs/migrations.md.
Sponsorship: see docs/sponsorship.md (GitHub Sponsors for ESNODE).
Containers: see docs/container.md for distroless build/run instructions.

Agent ↔ Server connection (summary)

Standalone: full local CLI/TUI, toggle metric sets, /metrics always on.
Connect to server: esnode-core server connect --address <host:port> [--token <join_token>] persists server + IDs and flips to managed mode (local tuning disabled, metrics plane untouched).
Disconnect: esnode-core server disconnect returns to standalone.
Status: esnode-core server status shows server, cluster ID, node ID, last contact.
TUI: esnode-core cli shows full menu when standalone; shows managed read-only panel when attached to ESNODE-Pulse.

Operator notes (day 1)

TUI surfaces degradation flags/score on Node Overview, Network & Disk, Agent Status; orchestrator screen shows token/loopback exposure.
App collector uses a 2s timeout; slow endpoints are skipped per scrape to avoid blocking other collectors.
TSDB: defaults to XDG (~/.local/share/esnode/tsdb), opt into /var/lib/esnode/tsdb explicitly and pre-create with correct perms.
Orchestrator: keep loopback-only unless allow_public=true and token is set; audit logs appear under tracing target audit.

Developer notes

cargo test --workspace includes a TUI render smoke test using ratatui’s test backend (no PTY required).
New metrics live in docs/metrics-list.md; gap tracking in docs/gap-logbook.md.
Local HTTP defaults avoid privileged paths; adjust in crates/agent-core/src/config.rs if changing defaults.

Install packages (Agent – public)

Fastest path (recommended):

curl -fsSL https://raw.githubusercontent.com/ESNODE/ESNODE-Core/main/public/install.sh | sh

Notes:

Installs the esnode-core binary under /usr/local/bin and (on Linux) enables a systemd service with a default /etc/esnode/esnode.toml.

To pin a version or avoid systemd setup, pass args via sh -s --:

curl -fsSL https://raw.githubusercontent.com/ESNODE/ESNODE-Core/main/public/install.sh | sh -s -- --version 0.1.0 --no-service

Example commands (adjust version/OS paths):

Ubuntu/Debian (.deb):

wget -O esnode-core_0.1.0_amd64.deb https://esnode.co/downloads/esnode-core_0.1.0_amd64.deb
sudo dpkg -i esnode-core_0.1.0_amd64.deb
sudo systemctl enable esnode-core && sudo systemctl start esnode-core

RHEL/CentOS (.rpm):

wget -O esnode-core-0.1.0-1.x86_64.rpm https://esnode.co/downloads/esnode-core-0.1.0-1.x86_64.rpm
sudo rpm -i esnode-core-0.1.0-1.x86_64.rpm
sudo systemctl enable esnode-core && sudo systemctl start esnode-core

Generic Linux (tar.gz):

wget -O esnode-core-0.1.0-linux-amd64.tar.gz https://esnode.co/downloads/esnode-core-0.1.0-linux-amd64.tar.gz
tar xvf esnode-core-0.1.0-linux-amd64.tar.gz
sudo mv esnode-core /usr/local/bin/
esnode-core --version

Windows/macOS artifacts will follow the same public repo layout when available.

ESNODE-Pulse (Enterprise controller – licensed)

Not distributed publicly or within this codebase. Binaries are provided only after registration/approval.
Revenue rule: ≤ USD 2M revenue → free starter license (registration required). > USD 2M → paid subscription before production use.
Request access: https://esnode.co/products#pulse (submit company + revenue band, accept terms).
ESNODE-Pulse binaries must not be mirrored to the public distribution paths; see the private repository for build details.

Deployment artifacts

Docker: Dockerfile (builds from public/distribution/releases/linux-amd64/esnode-core-0.1.0-linux-amd64.tar.gz)
systemd: deploy/systemd/esnode-core.service
Kubernetes DaemonSet: deploy/k8s/daemonset.yaml

Kubernetes deploy (operators & developers)

Build/pull image: docker build -t myregistry/esnode-core:0.1.0 -f Dockerfile . (uses public/distribution/releases/linux-amd64/esnode-core-0.1.0-linux-amd64.tar.gz).

Apply manifests (headless service + ConfigMap + DaemonSet):

kubectl apply --dry-run=client -f deploy/k8s/esnode-configmap.yaml
kubectl apply --dry-run=client -f deploy/k8s/esnode-service.yaml
kubectl apply --dry-run=client -f deploy/k8s/esnode-daemonset.yaml
kubectl apply -f deploy/k8s/

ConfigMap (esnode.toml) uses loopback-only orchestrator by default, enables TSDB at /var/lib/esnode/tsdb (hostPath volume), and keeps collectors on.
DaemonSet runs hostNetwork+hostPID, privileged for NVML access, and mounts /dev plus TSDB hostPath. Probes hit /healthz; port 9100 is exposed via headless Service for scraping.
Override image: and namespace as needed; set local_tsdb_path to match your volume; set orchestrator.token and allow_public only when intentionally exposing the control API.
If building multi-arch images, supply the matching tarball for each arch (e.g., linux-arm64); current Dockerfile targets linux-amd64.

Helm install (from repo checkout)

helm upgrade --install esnode-core ./deploy/helm/esnode-core \
  --set image.repository=myregistry/esnode-core \
  --set image.tag=0.1.0 \
  --set tsdb.hostPath=/var/lib/esnode/tsdb \
  --set config.orchestrator.allowPublic=false \
  --set config.orchestrator.token=""

Adjust hostPath, namespace (-n), tolerations/nodeSelector, and orchestrator/token as needed.

Terraform module (wrapping Helm)

Example using the local chart:

module "esnode_core" {
  source = "./deploy/terraform/esnode-core"

  namespace                = "default"
  release_name             = "esnode-core"
  image_repository         = "myregistry/esnode-core"
  image_tag                = "0.1.0"
  tsdb_host_path           = "/var/lib/esnode/tsdb"
  orchestrator_allow_public = false
  orchestrator_token       = ""
}

Run terraform init && terraform apply with a configured kubecontext. Update the image/tag and host paths to match your environment.

Documentation

Quickstart: docs/quickstart.md
Metrics reference: docs/metrics-list.md
Monitoring examples: docs/monitoring-examples.md
Architecture: docs/architecture.md
Platform matrix: docs/platform-matrix.md
Dashboards & alerts: docs/dashboards/grafana-esnode-core.json and docs/dashboards/alerts.yaml (import into Grafana/Prometheus)
Smoke test script: scripts/smoke.sh (builds, runs core+pulse locally, curls endpoints)

Kubernetes Deployment

ESNODE-Core is designed to run as a DaemonSet on Kubernetes, providing monitoring for each node.

Artifacts

Docker Image: myregistry/esnode-core:0.1.0 (Replace myregistry with your actual registry)
Release Tarball: esnode-core-0.1.0-linux-amd64.tar.gz
Checksum: public/distribution/releases/linux-amd64/esnode-core-0.1.0-linux-amd64.tar.gz.sha256

Deployment Steps

Configure: Edit deploy/k8s/esnode-configmap.yaml to adjust esnode.toml settings (e.g., specific orchestrator URL).

Appply Manifests:

kubectl apply -f deploy/k8s/esnode-configmap.yaml
kubectl apply -f deploy/k8s/esnode-daemonset.yaml
kubectl apply -f deploy/k8s/esnode-service.yaml

Requirements

The DaemonSet automatically requests privileges for hardware monitoring:

privileged: true security context
hostPID: true for process monitoring
hostNetwork: true (optional, but recommended for agent connectivity)
/dev and /proc mounts

Ensure your nodes support these capabilities. For GPU monitoring, the DaemonSet sets NVIDIA_VISIBLE_DEVICES=all.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github		.github
ESNODE-Orchestrator		ESNODE-Orchestrator
crates		crates
deploy		deploy
docs		docs
public		public
scripts		scripts
temp_verify		temp_verify
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
docker_hub_overview.md		docker_hub_overview.md
esnode_logo.jpg		esnode_logo.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

ESNODE | Source Available BUSL-1.1 | Copyright (c) 2024 Estimatedstocks AB

ESNODE-Core

Supported server OS targets (AI infra focused)

GPU/MIG visibility and build requirements

Features (v0.1 scope)

Power- & Model-awareness (roadmap)

Licensing & Distribution

Product split (effective now)

Components

Build & Run

Releases & GitHub tagging

Agent ↔ Server connection (summary)

Operator notes (day 1)

Developer notes

Install packages (Agent – public)

ESNODE-Pulse (Enterprise controller – licensed)

Deployment artifacts

Kubernetes deploy (operators & developers)

Helm install (from repo checkout)

Terraform module (wrapping Helm)

Documentation

Kubernetes Deployment

Artifacts

Deployment Steps

Requirements

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Contributors 2

Uh oh!

Languages

Uh oh!

License

ESNODE/ESNODE-Core

Folders and files

Latest commit

History

Repository files navigation

ESNODE | Source Available BUSL-1.1 | Copyright (c) 2024 Estimatedstocks AB

ESNODE-Core

Supported server OS targets (AI infra focused)

GPU/MIG visibility and build requirements

Features (v0.1 scope)

Power- & Model-awareness (roadmap)

Licensing & Distribution

Product split (effective now)

Components

Build & Run

Releases & GitHub tagging

Agent ↔ Server connection (summary)

Operator notes (day 1)

Developer notes

Install packages (Agent – public)

ESNODE-Pulse (Enterprise controller – licensed)

Deployment artifacts

Kubernetes deploy (operators & developers)

Helm install (from repo checkout)

Terraform module (wrapping Helm)

Documentation

Kubernetes Deployment

Artifacts

Deployment Steps

Requirements

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Contributors 2

Uh oh!

Languages

Packages