AI-powered root cause analysis for HPC and GPU clusters. Automated diagnosis, real-time streaming, human-in-the-loop remediation. Runs entirely on your infrastructure.
Everything you need to monitor, diagnose, and remediate GPU cluster issues.
LangGraph ReAct agent queries Prometheus metrics, Loki logs, and K8s state to find root causes automatically.
Visualize your entire GPU fleet — racks, nodes, GPUs. See health status at a glance with interactive floor plan.
Watch the AI think in real-time. See tool calls, reasoning steps, and results streamed live to the GUI.
Operator approval for high-risk actions. The AI proposes, you approve. Full audit trail for every action.
Runs entirely on your infrastructure. Self-hosted vLLM inference. No data leaves your cluster.
Single binary installer. Deploy the server, add GPU nodes from the GUI, agents deploy automatically via SSH.
Connect to your XPerf Server and start monitoring your GPU cluster.
Or build from source: cd apps/gui && flutter build macos