▲local-engine-router

shipped

2026·python · fastapi · llm · vllm · ollama · llama.cpp · gpu · proxy · openai-api·github.com/rxxusp/local-engine-router

A single-port, OpenAI- and Ollama-compatible reverse proxy that fronts a fleet of local inference engines and auto-swaps the GPU to whichever one owns the requested model.

On a unified-memory machine like an NVIDIA GB10 or Apple Silicon, the GPU and system RAM share one pool, so only one heavy model fits at a time. local-engine-router lets you address a whole fleet of models from a single endpoint anyway. It reads the model field on each request, finds the engine that owns that model, and if that engine is not the one currently resident, it swaps the GPU over to it.

The swap is the interesting part. It drains in-flight work on the outgoing engine, frees its VRAM, and then waits for the kernel to actually reclaim that memory by polling MemAvailable until it plateaus. Skipping that wait is the mistake a naive swapper makes: the incoming model's pre-flight check reads a stale free-memory number and aborts with a spurious out-of-memory error. Streaming clients are held open across the cold start with protocol-correct keep-alive frames, and a dropped connection never leaks a pinned GPU.

Engines are described in YAML with no Python. Four lifecycle types cover llama.cpp, vLLM, SGLang, Ollama, TabbyAPI, LM Studio, KoboldCpp, MLX, LocalAI, ramalama, and Modular MAX, with copy-paste presets for each. Both the OpenAI /v1 and the Ollama-native /api surfaces are first class and trigger swaps. It is cross-platform, exposes Prometheus metrics, ships a hermetic test suite and a multi-arch Docker image, and is MIT licensed.

view on github ← back to projects