local-engine-router: one endpoint for a fleet of local inference engines

An OpenAI- and Ollama-compatible reverse proxy that auto-swaps the GPU to the engine that owns the requested model, and the memory-settle wait that makes it work on a unified-memory box.

On a unified-memory box, you get to keep exactly one heavy model resident at a time. My workstation is an NVIDIA GB10 where the GPU shares memory with system RAM, and the same constraint shows up on Apple Silicon: there is one pool, and a 100 GiB model fills most of it. So I run a fleet of inference engines but only one of them can hold the GPU at any moment. What I actually wanted was to address all of my models by name from a single URL, and have the machine quietly load the right engine when a request for one of its models came in. local-engine-router is the small piece that does that, and v0.3.0 is now out.

It is a single-port reverse proxy that is both OpenAI- and Ollama-compatible, and it fronts a fleet of local inference engines behind one endpoint. When a request arrives, it reads the model field, finds the engine that owns that model, and if that engine is not the one currently resident on the GPU, it swaps to it. It speaks to vLLM, llama.cpp, SGLang, Ollama, TabbyAPI, LM Studio, KoboldCpp, MLX, LocalAI, ramalama, and Modular MAX.

git clone https://github.com/rxxusp/local-engine-router
cd local-engine-router && pip install .
local-engine-router --config config.yaml   # after you write a config.yaml

Why not an existing tool

The cluster-scale projects solve a different problem. GPUStack and the vLLM production stack assume you have hardware to spare and want to schedule across it, which is overkill on a box that holds one model. llama-swap swaps llama.cpp processes neatly, but it is llama.cpp shaped, and nothing I found cleanly fronts heterogeneous backends behind one OpenAI endpoint. I did not want to pick one engine and live inside it. I wanted to keep vLLM for the models it serves best, llama.cpp for the quants it loads, Ollama for convenience, and reach all of them by name from one address. The router is the thin layer that makes a single machine behave that way.

The swap, and the one step everyone gets wrong

The swap is the centerpiece, so it is worth walking through. A request comes in for a model the resident engine does not own. The router drains the in-flight requests on the outgoing engine, frees its VRAM, and then it waits. This waiting is the part that no other tool I know of does, and it is the reason the project exists.

When you free a large allocation, the kernel does not hand those pages back the instant the process exits. There is a lag while the OS reclaims them. If you start the incoming engine immediately, its pre-flight memory check reads MemAvailable, sees a number that is smaller than the memory that is really about to be free, and aborts with a spurious out-of-memory error. The model that would have fit fine refuses to load. The router avoids this by polling MemAvailable after the free and waiting for it to plateau, so the incoming engine only starts once the reclaimed memory has actually settled. On Linux it reads /proc/meminfo directly for the fast path, and falls back to psutil on macOS and Windows.

Once memory has settled, the router starts the target engine and waits for it to come up. Cold starts are not quick. Loading a large model can take minutes, and a streaming client sitting on an open connection will time out or give up during that window unless something keeps the channel alive. So the router holds streaming clients open across the cold start with keep-alive frames, and it sends the protocol-correct kind for each surface: SSE comment frames on the /v1/* paths, bare newlines on the Ollama /api/* paths. If a client disconnects while waiting, the router notices and never leaves a pinned GPU behind it. A dropped connection does not leak a loaded model.

What you configure

Everything is YAML, and there is no Python to write. You describe each engine with one of four lifecycle types. generic_process is the common case, where the router launches and stops a server such as llama.cpp or vLLM. api_swap is for engines like TabbyAPI that load and unload models over HTTP rather than by restarting a process. ollama drives the Ollama daemon. And ds4 is an advanced escape hatch for driving a systemd-managed service, for the cases the first three do not cover. There are copy-paste presets for all eleven backends, so the usual starting point is to paste a preset and edit a path.

engines:
  llamacpp:
    type: generic_process
    base_url: http://127.0.0.1:8080
    start_cmd: [llama-server, -m, /models/qwen2.5-7b.gguf, --port, "8080"]
    ready_path: /health
  ollama:
    type: ollama
    base_url: http://127.0.0.1:11434

models:
  - id: qwen2.5-7b-instruct
    engine: llamacpp
  - id: llama3.1:8b
    engine: ollama

Both API surfaces are first class. The OpenAI /v1/* routes and the Ollama-native /api/* routes both trigger swaps, so a tool that speaks either dialect can drive the whole fleet without knowing which engine is behind a given model name. Process and memory control are cross-platform. There are Prometheus metrics for the swaps and the engines. The test suite is hermetic and runs 206 tests with no GPU and no real engines in the loop, which is what lets me trust changes to the swap logic. It is pip-installable, there is a multi-arch Docker image, and it is MIT licensed.

What it does not do

Two honest limitations. First, a non-streaming request that triggers a swap blocks for the entire duration of that swap. The keep-alive trick only helps streaming clients, so for plain request-response calls you should set a generous read-timeout on the client and expect the occasional multi-minute wait when the model has to change. Second, exactly one engine holds the GPU at a time. That is not a missing feature, it is the design. On a single unified-memory pool there is only room for one heavy model, and the router's job is to make switching between them clean rather than to pretend you can run them all at once.

It pairs naturally with llmtop, which I wrote earlier to observe this same single-GPU local-inference setup. The router decides which engine owns the GPU, and llmtop shows you which one currently does. The source for the router is on GitHub at github.com/rxxusp/local-engine-router. If you run several engines on one memory-constrained machine and want to reach all of your models from one endpoint, give it a try.

✎local-engine-router: one endpoint for a fleet of local inference engines