Key Takeaways
- The trigger for a gateway is operational, not architectural — when "which model served this and what did it cost" becomes a real question, a gateway pays for itself
- One model at low volume needs no gateway; calling the provider directly is the correct, cheapest answer until the problem grows — One model at low volume needs no gateway; calling the provider directly is the correct, cheapest answer until the problem grows
- The first fork is hosted vs self-hosted: a managed control plane buys speed, a self-hosted router keeps prompts in your perimeter and removes the margin — The first fork is hosted vs self-hosted: a managed control plane buys speed, a self-hosted router keeps prompts in your perimeter and removes the margin
- Leaving an aggregator is rarely all-or-nothing — wrap it with a gateway that treats it as one provider, then move high-volume models direct later
The instinct to add a gateway too early
There is a pattern I have watched repeat with every infrastructure trend: a useful component appears, and teams adopt it a layer too early, before the problem it solves actually exists for them. The LLM gateway is the current instance. It is a genuinely good idea — one endpoint in front of many models, with routing, failover, caching, observability, and spend control — and for a single model at modest volume it is overhead you do not need yet. The first question is not "which gateway," it is "do I have the problem a gateway solves."
The trigger is operational, not architectural
You need a gateway when the work around the model call gets complicated, not when you add a second model for its own sake. Concretely: you are calling several models and want each request routed to the cheapest capable one; you need resilience when a provider has an outage; you want to cap and attribute spend across teams; or you need to answer "which model served this request, and what did it cost" and currently cannot. When two or three of those are true, a gateway stops being overhead and starts paying for itself. Until then, calling the provider directly is the correct, cheaper answer, and I would not apologize for the simplicity.
The first fork: hosted or self-hosted
Once you do need one, the decision that matters most is not the feature matrix — it is data residency. A managed control plane gets you running fastest and takes operations off your plate, but it sits in your request path as a hosted dependency. A self-hosted router keeps every prompt inside your perimeter and strips any per-call margin, at the cost of you owning its uptime. If your prompts carry anything you would not want leaving your network, that single fact decides the fork before you compare a single feature. I have seen teams burn weeks on a feature bake-off when the residency requirement had already made the call for them.
Wrap, don't rip and replace
The other mistake is treating adoption as all-or-nothing. If you are already on an aggregator and want governance or self-hosting, you do not have to migrate off it — a router or control plane can treat that aggregator as one backend behind it, so the application keeps calling the same endpoint while you add the layer you were missing. This is also how I would leave a provider I had outgrown: wrap it, add the control, then move the highest-volume model direct once the math justifies removing the margin. Reversible steps beat a big-bang migration almost every time.
Where the gateway sits in the cost picture
People reach for a gateway expecting it to fix their bill. It helps — central routing to the cheapest capable model and caching repeat calls are real levers — but the larger savings usually live elsewhere, in the work I have written about in tokenmaxxing and free AI quota arbitrage: matching the model to the task, exploiting free tiers and off-peak windows, and not paying for context you do not need. A gateway makes those levers easier to pull from one place. It does not substitute for the discipline of pulling them.
The call I'd make
Start without a gateway. Go direct, keep it simple, and let the operational pain tell you when you have the problem. When you do — several models, real resilience needs, spend you cannot attribute, observability you cannot get — decide the hosted-versus-self-hosted fork on data residency first, then wrap what you already run rather than replacing it. The gateway is a control point worth having at the right time. The skill is not picking one; it is knowing when you have earned the right to add it.
Frequently asked questions
What is an LLM gateway?
A single endpoint in front of multiple model providers that adds the operational layer raw APIs leave out: routing each request to the cheapest or healthiest model, failing over when one provider degrades, caching prompts, tracking per-request cost, and governing spend. Your application calls the gateway once, in an OpenAI-compatible shape, and the gateway decides which model actually serves the request. It is a control point, not a passthrough.
Do I actually need one?
If you call one model at low volume, no — go direct and keep the stack simple. A gateway earns its place when you route across several models, need resilience against any single provider failing, want to cap and attribute spend across teams, or need request-level observability you do not get from raw APIs. The trigger is operational complexity, not model count alone: the moment "which model served this request and what did it cost" becomes a question you cannot answer, the gateway has already paid for itself.
Should I self-host or use a managed gateway?
Decide data residency before features. A managed control plane (Portkey and peers) is fastest to adopt and offloads operations, at the cost of a hosted dependency in your request path. A self-hosted router (LiteLLM and the open-source options) keeps prompts inside your perimeter and removes any per-call margin, at the cost of running it yourself. If prompt data must not leave your network, that decides it before any feature comparison does.
Is a gateway just an extra hop that adds latency?
It adds a hop, but a well-run gateway usually nets out neutral-to-positive because of caching, smarter routing, and avoided failures. The honest risk is not latency — it is putting a single point of failure in your critical path. Mitigate it the way you would any critical dependency: a self-hosted gateway you control, health checks, and a direct-to-provider fallback for your highest-volume model. If you cannot tolerate the gateway being down, design for it to fail open.
Can I keep my current setup and add a gateway later?
Yes, and that is usually the right sequence. Because a router like LiteLLM or a control plane like Portkey can treat your current provider — including an aggregator like OpenRouter — as one backend, you can wrap what you have rather than rip and replace. The application keeps calling an OpenAI-compatible endpoint; you gain governance, caching, or self-hosting in front, then move your highest-volume model direct once the economics justify stripping the margin.
How does a gateway relate to LLM cost?
It is one of the larger levers, because it routes each request to the cheapest capable model and caches repeat calls — but it is not magic, and it adds its own margin if managed. The bigger cost wins usually come from the things I have written about separately: matching the model to the task, exploiting free tiers and off-peak windows, and not burning tokens on context you do not need. A gateway makes those levers easier to pull centrally; it does not replace the discipline.
Ready to Transform Your AI Strategy?
Get personalized guidance from someone who's led AI initiatives at Adidas, Sweetgreen, and 50+ Fortune 500 projects.