llmtop: an nvtop for local LLM inference

Zero-config autodiscovery of vLLM, Ollama, llama.cpp and friends, and the GB10 unified-memory problem that made the GPU panel interesting.

I usually have several inference engines running at once on my workstation: a vLLM server here, an Ollama daemon there, sometimes a llama.cpp build and a small router in front of all of them. nvidia-smi tells me the GPU is busy, but not which engine is busy, which model is loaded, how full the KV cache is, or how many requests are queued. So I built a tool that answers those questions in one screen, and ran the build itself as a small fan-out of coding agents.

llmtop is an nvtop for local LLM inference. You run one command and it finds every inference engine on the machine, works out what each one is serving and how it is configured, and shows live GPU and serving metrics in a terminal UI. There are no flags, no config file, and no manual port list. The headline feature is that it just finds everything.

pip install git+https://github.com/rxxusp/llmtop.git
llmtop

Autodiscovery is the hard part

Finding engines reliably means not trusting any single signal. llmtop cross-correlates three independent ones. It walks the process table for known launchers such as vllm serve, llama-server, ollama, text-generation-launcher and sglang.launch_server. It scans common serving ports on loopback plus any ports owned by those processes. Then it fingerprints each open port with cheap read-only calls and classifies by the shape of the response: an OpenAI style list from /v1/models, the vllm prefixed Prometheus series from /metrics, the Ollama /api/tags model list, the llama.cpp /props document, and so on.

The interesting failures showed up immediately when I pointed it at my own machine. An auth gated router returned HTTP 401 to every probe path, and the first version of several adapters treated a 401 as a positive match, so whichever adapter ran first claimed the router as its own engine type. A separate web UI on another port returned HTML for most paths but a real JSON version document for one of them, which fooled a weak fallback check. The fix is a discipline worth stating plainly: a distinctive endpoint returning 401 is not proof of any specific engine. Only the generic OpenAI adapter treats a 401 as present but blocked, and the specific adapters claim an auth gated port only when the process scan already named that engine. Set LLMTOP_API_KEY and the protected endpoints become introspectable again.

The GB10 unified memory problem

The GPU panel turned out to be the part with the most teeth, because I run this on an NVIDIA GB10 where the GPU shares memory with system RAM. On that device the obvious NVML call for memory simply refuses to answer:

nvmlDeviceGetMemoryInfo(h)   -> NVMLError: Not Supported (code 3)
nvmlDeviceGetEnforcedPowerLimit(h) -> Not Supported
nvmlDeviceGetClockInfo(h, NVML_CLOCK_MEM) -> Not Supported

A naive monitor either crashes here or reports a bogus zero total. The correct behavior is to wrap every NVML getter on its own, so that one unsupported field never costs the others. Utilization, temperature, power draw, and the SM clock all work fine on GB10, so those are shown. Total memory, the power cap, the memory clock, and fan speed are not supported, so those degrade to n/a rather than to a misleading zero. When the memory query fails, llmtop labels the device as unified memory and reports the system RAM total instead, which is the honest figure for a shared pool.

The pleasant surprise is that per process GPU memory still works through NVML even when the total does not. That means llmtop can attribute memory to individual engines on a device where the aggregate is unavailable. There is one more wrinkle: my vLLM runs inside a container as root, so the socket to process id mapping that would normally link a port to a process is denied without elevated privileges. NVML, however, hands back the GPU process directly. So when the port owner is unknown, llmtop walks the GPU process and its ancestors, classifies the tree by command line, and attributes the memory to the matching engine. The result is that the vLLM process holding 102 GiB shows up correctly against the vLLM server on its port, with its real launcher process id and uptime recovered as a side effect.

Read only by default

A monitor for inference servers must never perturb them. llmtop only ever issues cheap introspection and metrics requests. It never sends a completion or generation request, so it never costs a token and never changes server state. Everything is time bounded and every sampling path is written to degrade rather than raise, so a missing GPU, a server that goes away mid scan, or a partial metrics response leaves you with what is known and the rest marked n/a, never a stack trace.

Adapters are the extension point

Supporting a new engine is one small module with three methods: detect, describe, and metrics. The discovery loop, the GPU sampling, the rate derivation, and the UI are all shared, so an adapter only has to know how to recognize its server and read its models and counters. The repository ships adapters for vLLM, llama.cpp, Ollama, TGI, SGLang, a generic OpenAI compatible fallback, and a catch all unknown, and the contract for writing your own is documented.

llmtop is built with Python and Textual, it is MIT licensed, and the source is on GitHub at github.com/rxxusp/llmtop. There is also a headless mode, llmtop --json, that prints one discovery and metrics snapshot for scripting. If you run models locally and want to see the whole stack at a glance, give it a try.

✎llmtop: an nvtop for local LLM inference