This repository contains the source, build tooling, and documentation for the ESNODE-Core Agent. The ESNODE-Pulse controller is maintained separately (licensed) and is not part of this codebase.
- Ubuntu Server (primary, ~60–70% of AI fleet; best CUDA/driver/toolchain support)
- RHEL-compatible: RHEL / Rocky Linux / AlmaLinux (enterprise compliance, FIPS-ready)
- NVIDIA DGX OS (Ubuntu-based, pre-tuned for DGX appliances)
- SUSE Linux Enterprise Server (SLES) (enterprise/HPC niches)
- Debian (research/custom environments)
Packaging is provided as tar.gz plus optional deb/rpm; Windows zips are included for hybrid labs. On Linux, install scripts set up binaries under /usr/local/bin and create systemd units so esnode-core runs without extra PATH tweaks (ESNODE-Pulse ships separately).
ESNODE-Core is a GPU-aware host metrics exporter for Linux nodes. It exposes CPU, memory, disk, network, and GPU telemetry at /metrics in Prometheus text format so observability stacks can scrape a single endpoint per node. Agents can run standalone or attach to an ESNODE-Pulse control plane while keeping Prometheus/OTLP outputs unchanged.
- GPU metrics require the
gpufeature (enabled by default). MIG metrics additionally require building with thegpu-nvml-ffifeature and enablingenable_gpu_migin config; otherwise MIG metrics will remain zero. gpu_visible_devices(orNVIDIA_VISIBLE_DEVICES) filters which GPUs are scraped; empty/allscrapes all.mig_config_devices(orNVIDIA_MIG_CONFIG_DEVICES) further filters which GPUs are considered for MIG scraping whenenable_gpu_migis true.k8s_modeemits a small set of legacy-compatible GPU metrics with a singlegpulabel (suffix_compat) using Kubernetes/CDI resource-style names (nvidia.com/gpu,nvidia.com/gpu-<mig-profile>-mig) alongside the UUID+index labeled metrics to reduce dashboard breakage.- MIG compatibility labels are only emitted when
k8s_modeis enabled; MIG info/metrics still requiregpu-nvml-ffi+enable_gpu_mig. enable_gpu_eventsstarts an NVML event loop for XID/ECC/power/clock events (best-effort). The loop is non-blocking with a short poll and may miss very bursty streams; counters are monotonic but not guaranteed exact.gpu-nvml-ffi-extis an optional feature gate for extra NVML FFI (PCIe field counters, etc.). These are best-effort and unverified without suitable hardware; placeholders remain zero when unsupported.- NVSwitch/copy-engine clocks/power-cap reason codes are exposed as gauges but rely on NVML support; many remain zero on hardware/driver versions that do not surface them.
- Single binary with zero-config defaults (
0.0.0.0:9100, 5s interval). - Collectors: CPU, memory, disk, network, GPU (NVML-based; gracefully disabled if unavailable).
- Power-aware: optional power collector reads RAPL/hwmon/BMC paths for CPU/package/node power; GPU power via NVML.
- Self-observability: scrape duration + error counters per collector.
- Health endpoint at
/healthz. - JSON status endpoint at
/status(/v1/statusalias) with node load, power, temps, GPU snapshot (identity/health/MIG tree), last scrape/errors; SSE stream at/eventsfor near-real-time loops. - Degradation signals: disk busy/latency, network drops/retrans, swap spikes, and GPU throttle/ECC flags roll up into
esnode_degradation_score; surfaced in/statusand the TUI. - Control-plane safety: orchestrator
/orchestrator/*binds loopback-only by default; setorchestrator.allow_public=trueto expose, and useorchestrator.tokento requireAuthorization: Bearer <token>. Auth successes/failures and task/heartbeat actions emit structured audit logs (targetaudit). - Local TSDB defaults to a user-writable XDG path so non-root runs no longer fail on
/var/lib/esnode/tsdb; override withlocal_tsdb_pathwhen you want/var/lib.
- Power: keep
esnode_gpu_power_wattsand add optionalesnode_cpu_package_power_watts,esnode_node_power_watts, and PDU/BMC readings where available (RAPL/hwmon/IPMI). - Model metadata: pair ESNODE with app-exported metrics such as
model_tokens_total,model_requests_total, and latency histograms labeled bymodel/namespace. - Efficiency: use PromQL like
sum(rate(model_tokens_total[5m])) / sum(rate(esnode_node_power_watts[5m]))for tokens-per-watt and scale to tokens-per-kWh; seedocs/monitoring-examples.mdfor dashboards.
- Source-available under ESNODE BUSL-1.1 (see
LICENSE). - Trademarks governed by
docs/TRADEMARK_POLICY.md; no rebranding or redistribution of binaries. - Contributions require agreement to
docs/CLA.md. - Official binaries and commercial terms are controlled solely by Estimatedstocks AB.
- ESNODE-Core (public, source-available)
- License: current ESNODE BUSL-style terms.
- Usage: free for internal use; redistribution/trademark restrictions still apply.
- Distribution: public binaries at
https://esnode.co/downloads.
- ESNODE-Pulse (licensed-only, revenue-based; separate repo)
- License: proprietary.
- Revenue rule: ≤ USD 2M revenue → free starter license after registration; > USD 2M → paid subscription required before production use.
- Distribution: not in this repository. Binaries are provided only after registration via
https://esnode.co/products#pulse(or designated portal).
esnode-core: per-node collector exposing Prometheus/metrics, JSON/status(/v1/status), and SSE/events.esnode-orchestrator: optional autonomous resource manager (embedded lib, CLI-configurable).esnode-pulse: licensed controller that polls agents and centralizes policy/alerts (not included in this repository).
See docs/architecture.md and docs/platform-matrix.md for topology and build targets.
cargo build --workspace --release
./target/release/esnode-core- Cross-compiling on macOS for
x86_64-unknown-linux-gnu/aarch64-unknown-linux-gnurequires the corresponding GNU toolchains (e.g.,brew install x86_64-unknown-linux-gnu).
Configuration precedence: CLI flags > env vars > esnode.toml > defaults. See docs/quickstart.md for full examples.
- Config flags of interest:
enable_gpu_mig(default false) – turn on MIG scraping when built withgpu-nvml-ffi.enable_gpu_events(default false) – run NVML event loop (best-effort) for XID/ECC/clock/power events.enable_gpu_amd(default false) – experimental AMD/ROCm collector scaffolding; emits no metrics unless rsmi/rocm-smi support is added.k8s_mode(default false) – emit compatibility labels using Kubernetes/CDI resource names alongside UUID/index labels.gpu_visible_devices/NVIDIA_VISIBLE_DEVICES– filter visible GPUs.mig_config_devices/NVIDIA_MIG_CONFIG_DEVICES– filter MIG-capable GPUs whenenable_gpu_migis true.- Optional
gpu-nvml-ffi-extfeature enables additional NVML field-based counters (PCIe/etc.), best-effort only. enable_app+app_metrics_url– app/model metrics collector uses a 2s HTTP timeout; slow or hung endpoints are skipped for that scrape without blocking other collectors.- Orchestrator control API (
/orchestrator/*) is exposed only on loopback listeners by default; setorchestrator.allow_public=trueexplicitly if you need to serve it on non-loopback addresses and configureorchestrator.tokento requireAuthorization: Bearer <token>.
Local TSDB path (default): when enable_local_tsdb is true, the agent now resolves local_tsdb_path to $XDG_DATA_HOME/esnode/tsdb or ~/.local/share/esnode/tsdb so non-root runs don’t fail on /var/lib. Set ESNODE_LOCAL_TSDB_PATH or the config key if you want /var/lib/esnode/tsdb and ensure the directory is writable by the agent user.
- Tagging
vX.Y.Zon the default branch triggers.github/workflows/release.yml, which:- Runs
cargo test --workspace --locked --target <triple>on Linux (x86_64), macOS (aarch64), and Windows (x86_64). - Builds release binaries with default features for the same triples.
- Packages artifacts as tar.gz (Linux/macOS) or zip (Windows) and attaches them to the GitHub Release created from the tag.
- Runs
- Artifact names:
esnode-core-linux-x86_64.tar.gzesnode-core-macos-aarch64.tar.gzesnode-core-windows-x86_64.zip
- Binaries are built with default features; MIG metrics still require
gpu-nvml-ffiandenable_gpu_migwhen running on MIG-capable hosts. - For additional targets or feature builds, run
cargo build --release --locked --target <triple>locally and publish as needed. - Manual packaging:
scripts/dist/esnode-core-release.sh(optionally withESNODE_VERSION=X.Y.Z) builds and collects Linux tar/deb/rpm bundles (and Windows zip if toolchain available) underpublic/distribution/releases/. Push a tagvX.Y.Zafter verification to publish GitHub release artifacts automatically.
Community & policies:
- Contribution guidelines: see
CONTRIBUTING.md. - Code of Conduct: see
CODE_OF_CONDUCT.md. - Security reporting: see
SECURITY.md. - Support & upgrade policy: see
docs/support-policy.md. - Metric label migration (UUID-first): see
docs/migrations.md. - Sponsorship: see
docs/sponsorship.md(GitHub Sponsors for ESNODE). - Containers: see
docs/container.mdfor distroless build/run instructions.
- Standalone: full local CLI/TUI, toggle metric sets,
/metricsalways on. - Connect to server:
esnode-core server connect --address <host:port> [--token <join_token>]persists server + IDs and flips to managed mode (local tuning disabled, metrics plane untouched). - Disconnect:
esnode-core server disconnectreturns to standalone. - Status:
esnode-core server statusshows server, cluster ID, node ID, last contact. - TUI:
esnode-core clishows full menu when standalone; shows managed read-only panel when attached to ESNODE-Pulse.
- TUI surfaces degradation flags/score on Node Overview, Network & Disk, Agent Status; orchestrator screen shows token/loopback exposure.
- App collector uses a 2s timeout; slow endpoints are skipped per scrape to avoid blocking other collectors.
- TSDB: defaults to XDG (
~/.local/share/esnode/tsdb), opt into/var/lib/esnode/tsdbexplicitly and pre-create with correct perms. - Orchestrator: keep loopback-only unless
allow_public=trueandtokenis set; audit logs appear under tracing targetaudit.
cargo test --workspaceincludes a TUI render smoke test using ratatui’s test backend (no PTY required).- New metrics live in
docs/metrics-list.md; gap tracking indocs/gap-logbook.md. - Local HTTP defaults avoid privileged paths; adjust in
crates/agent-core/src/config.rsif changing defaults.
Fastest path (recommended):
curl -fsSL https://raw.githubusercontent.com/ESNODE/ESNODE-Core/main/public/install.sh | shNotes:
- Installs the
esnode-corebinary under/usr/local/binand (on Linux) enables asystemdservice with a default/etc/esnode/esnode.toml. - To pin a version or avoid systemd setup, pass args via
sh -s --:curl -fsSL https://raw.githubusercontent.com/ESNODE/ESNODE-Core/main/public/install.sh | sh -s -- --version 0.1.0 --no-service
Example commands (adjust version/OS paths):
- Ubuntu/Debian (
.deb):wget -O esnode-core_0.1.0_amd64.deb https://esnode.co/downloads/esnode-core_0.1.0_amd64.deb sudo dpkg -i esnode-core_0.1.0_amd64.deb sudo systemctl enable esnode-core && sudo systemctl start esnode-core
- RHEL/CentOS (
.rpm):wget -O esnode-core-0.1.0-1.x86_64.rpm https://esnode.co/downloads/esnode-core-0.1.0-1.x86_64.rpm sudo rpm -i esnode-core-0.1.0-1.x86_64.rpm sudo systemctl enable esnode-core && sudo systemctl start esnode-core
- Generic Linux (
tar.gz):wget -O esnode-core-0.1.0-linux-amd64.tar.gz https://esnode.co/downloads/esnode-core-0.1.0-linux-amd64.tar.gz tar xvf esnode-core-0.1.0-linux-amd64.tar.gz sudo mv esnode-core /usr/local/bin/ esnode-core --version
- Windows/macOS artifacts will follow the same public repo layout when available.
- Not distributed publicly or within this codebase. Binaries are provided only after registration/approval.
- Revenue rule: ≤ USD 2M revenue → free starter license (registration required). > USD 2M → paid subscription before production use.
- Request access:
https://esnode.co/products#pulse(submit company + revenue band, accept terms). - ESNODE-Pulse binaries must not be mirrored to the public distribution paths; see the private repository for build details.
- Docker:
Dockerfile(builds frompublic/distribution/releases/linux-amd64/esnode-core-0.1.0-linux-amd64.tar.gz) - systemd:
deploy/systemd/esnode-core.service - Kubernetes DaemonSet:
deploy/k8s/daemonset.yaml
- Build/pull image:
docker build -t myregistry/esnode-core:0.1.0 -f Dockerfile .(usespublic/distribution/releases/linux-amd64/esnode-core-0.1.0-linux-amd64.tar.gz). - Apply manifests (headless service + ConfigMap + DaemonSet):
kubectl apply --dry-run=client -f deploy/k8s/esnode-configmap.yaml kubectl apply --dry-run=client -f deploy/k8s/esnode-service.yaml kubectl apply --dry-run=client -f deploy/k8s/esnode-daemonset.yaml kubectl apply -f deploy/k8s/
- ConfigMap (
esnode.toml) uses loopback-only orchestrator by default, enables TSDB at/var/lib/esnode/tsdb(hostPath volume), and keeps collectors on. - DaemonSet runs hostNetwork+hostPID, privileged for NVML access, and mounts
/devplus TSDB hostPath. Probes hit/healthz; port 9100 is exposed via headless Service for scraping. - Override
image:and namespace as needed; setlocal_tsdb_pathto match your volume; setorchestrator.tokenandallow_publiconly when intentionally exposing the control API. - If building multi-arch images, supply the matching tarball for each arch (e.g.,
linux-arm64); current Dockerfile targetslinux-amd64.
helm upgrade --install esnode-core ./deploy/helm/esnode-core \
--set image.repository=myregistry/esnode-core \
--set image.tag=0.1.0 \
--set tsdb.hostPath=/var/lib/esnode/tsdb \
--set config.orchestrator.allowPublic=false \
--set config.orchestrator.token=""Adjust hostPath, namespace (-n), tolerations/nodeSelector, and orchestrator/token as needed.
Example using the local chart:
module "esnode_core" {
source = "./deploy/terraform/esnode-core"
namespace = "default"
release_name = "esnode-core"
image_repository = "myregistry/esnode-core"
image_tag = "0.1.0"
tsdb_host_path = "/var/lib/esnode/tsdb"
orchestrator_allow_public = false
orchestrator_token = ""
}Run terraform init && terraform apply with a configured kubecontext. Update the image/tag and host paths to match your environment.
- Quickstart:
docs/quickstart.md - Metrics reference:
docs/metrics-list.md - Monitoring examples:
docs/monitoring-examples.md - Architecture:
docs/architecture.md - Platform matrix:
docs/platform-matrix.md - Dashboards & alerts:
docs/dashboards/grafana-esnode-core.jsonanddocs/dashboards/alerts.yaml(import into Grafana/Prometheus) - Smoke test script:
scripts/smoke.sh(builds, runs core+pulse locally, curls endpoints)
ESNODE-Core is designed to run as a DaemonSet on Kubernetes, providing monitoring for each node.
- Docker Image:
myregistry/esnode-core:0.1.0(Replacemyregistrywith your actual registry) - Release Tarball: esnode-core-0.1.0-linux-amd64.tar.gz
- Checksum:
public/distribution/releases/linux-amd64/esnode-core-0.1.0-linux-amd64.tar.gz.sha256
-
Configure: Edit
deploy/k8s/esnode-configmap.yamlto adjustesnode.tomlsettings (e.g., specific orchestrator URL). -
Appply Manifests:
kubectl apply -f deploy/k8s/esnode-configmap.yaml kubectl apply -f deploy/k8s/esnode-daemonset.yaml kubectl apply -f deploy/k8s/esnode-service.yaml
The DaemonSet automatically requests privileges for hardware monitoring:
privileged: truesecurity contexthostPID: truefor process monitoringhostNetwork: true(optional, but recommended for agent connectivity)/devand/procmounts
Ensure your nodes support these capabilities. For GPU monitoring, the DaemonSet sets NVIDIA_VISIBLE_DEVICES=all.
