Recently, we introduced the Agentic Runtime Stack—the software architecture required to build and operate agentic systems. These systems depend on large language models (LLMs) to reason, plan, and navigate ambiguity. While many companies rely on hosted APIs from providers like OpenAI, Anthropic, and Google, there's a growing shift toward self-hosting open-source models.

The motivations are clear: self-hosting gives companies more control over data, lower cost per token, and the ability to optimize compute for quality. As open-source models like DeepSeek, Qwen, and Llama achieve state-of-the-art performance, teams increasingly want the flexibility to make the right tradeoffs—allocating more compute for harder problems, adjusting decoding strategies, and fine-tuning latency versus accuracy based on the task.

Importantly, inference doesn’t just matter in production. A high-quality inference stack is also essential during training, where it helps evaluate intermediate checkpoints under real-world conditions. It’s equally critical during evaluation, especially in workflows like DPO and RLHF, where generating, scoring, and ranking responses at scale directly influences the next training step.

And to do that, you need a robust AI inference stack.

AI Inference Stack

The AI inference stack is the software stack needed to run large language models. A year ago, this stack wasn’t really a “stack” at all: you picked an inference engine like vLLM (server) or Ollama (desktop). But as usage has scaled and the ecosystem matured, a more sophisticated set of tools and components has started to emerge.

What’s driving this shift is a growing focus on post-training optimization—everything that happens after a model is trained, but before (or during) deployment. This includes inference-time strategies to improve quality like reranking, longer context windows, and ensemble decoding. These techniques spend more compute to improve accuracy, but doing so efficiently requires infrastructure support. At the same time, post-training also includes workflows like model evaluation, preference tuning (e.g., DPO), and fine-tuning checkpoint selection—all of which rely heavily on fast, flexible inference. Many of these techniques were once proprietary, used only inside top labs, but are now being published and operationalized in open source. The result is a fast-evolving inference stack—blending application-level insight with infra-level performance tuning—that’s becoming foundational to how modern LLMs are deployed and improved.

We’ve identified five major layers to the AI inference stack: orchestration and routing, KV caching, the inference engine, compute infrastructure management, and GPU clouds. Reflecting the diversity in which organizations deploy their inference stack, we see three main deployment models: a “build-your-own” approach where teams can mix-and-match technical components, an integrated deployment approach where a single integrated stack is deployed on your own compute, or an “inference-as-a-service” approach that provides an API to upload your own bespoke models.

GPU Clouds

AI inference requires GPUs—the more, the better. Hyperscalers such as AWS, Azure, and Google Cloud all offer GPU instances, but many teams also evaluate GPU-focused marketplaces and specialty clouds like CoreWeave, Lambda Labs, and The San Francisco Compute Company. These alternatives can deliver advantages in cost (particularly for spot or committed-use pricing), early access to niche or next-generation hardware, and tailored support for AI workloads. Hyperscalers provide a unified platform, enabling organizations to leverage their existing workflows and expertise.

Compute infrastructure management

The compute infrastructure layer handles the deployment and physical resources (cloud or on-premise) that all components run on. This layer handles the provision and management of the servers, GPUs, and networking so that LLM inference can scale and be reliable. Kubernetes is the most common approach to compute infrastructure management today for AI inference. Virtually all AI inference software providers include a Kubernetes installation and configuration option.

LLM inference

At the core of the stack is the inference engine: the optimized runtime that actually runs the model to generate outputs. This layer encompasses the software that loads LLM weights onto hardware and executes the neural network forward pass efficiently. A state-of-the-art inference engine is highly optimized for performance – it can batch multiple queries together, utilize mixed-precision or quantized math, and streamline memory usage to keep GPUs busy. Systems like vLLM or SGLang constantly introduce new algorithms that can dramatically improve performance.

KV Cache (Distributed Key-Value Cache)

Large language models spend a lot of time recomputing attention over the same tokens, especially with long prompts or conversation history. The KV caching layer addresses this by storing the model’s intermediate key-value pairs (from the transformer’s attention mechanism) and reusing them for subsequent tokens or repeated context. By caching these computed attention states, the model doesn’t have to redo work for each new token – improving latency and efficiency for long or streaming responses. In a modern inference system, this cache may be distributed or unified across sessions so that multiple requests can leverage common prefixes or past computations.

Orchestration and routing

Traditionally, infrastructure routing was simple and blind: requests were directed to the nearest or least busy server based on load or latency, without any understanding of the request itself. In a modern AI inference stack, routing must be payload-aware. The orchestrator inspects each incoming prompt—looking at factors like length, structure, context tokens, and model requirements—and uses that information to make smarter decisions. It might route long prompts to a node with a warm KV cache, batch short prompts together for efficient GPU use, or steer complex requests to a more capable model. It also considers things like model versioning, dynamic scaling, fallback logic, and policy enforcement. In effect, routing becomes the system’s strategic brain: not just moving traffic, but actively shaping how each request is served to optimize for latency, cost, and output quality.

Observability and evaluation

Observability in the AI inference stack is not just about tracking hardware metrics or request latency. Instead, observability is an active feedback loop: not only alerting to performance anomalies, but continuously validates output quality under real-world workloads and feeds insights back into retraining and optimization. By ingesting every prompt and its response, these systems surface critical issues such as hallucinations, model drift, or accuracy regressions, and correlate them with model versions, routing decisions, and cache states.

Integrated inference stacks

Integrated inference stacks bundle multiple layers of the AI inference stack into a cohesive platform, reducing the configuration and operational overhead of stitching together the different layers. Most of these end-to-end solutions default to high-performance inference engines (usually vLLM) and layer in additional routing, caching, and other capabilities. These platforms often introduce an “inference graph” abstraction to define not only how data flows through caches, preprocessors, model stages, and postprocessors, but also how specialized sub‑tasks—such as disaggregated prefill and decode—are routed to the optimal hardware. By treating the entire inference workflow as a single, versioned graph, teams gain simpler upgrades, targeted hardware matching, and unified observability across every step of the pipeline.

Inference-as-a-Service

Inference-as-a-service providers offer a fully managed cloud stack for running large language models. Users can select from pre-trained models or upload their own, and the platform handles everything else—from infrastructure provisioning to scaling, caching, and API exposure. Many of these services also extend beyond inference, offering features for fine-tuning, evaluation, and model management, making them useful across the broader model lifecycle. Compared to best-of-breed stack components, these platforms tend to be more opinionated, and can be grouped into several broad and overlapping categories:

Operations-centered solutions such as Amazon Bedrock, Fireworks.AI, Together.AI, and Fal, which provide high efficiency, zero MLOps inference solutions. The pricing model for these providers tends to be token-metered.
Closely related to the ops-centered solutions are the AI hardware companies such as Cerebras, SambaNova, and Groq, which all provide inference-as-a-service, but on their own bespoke hardware systems.
Developer-centered solutions such as Modal and BaseTen provide SDKs that facilitate different parts of the modern inference workflow. These solutions tend to be compute-metered.

Opportunities

Thus far, the inference-as-a-service layer who are not building their own chips have seen the most robust monetization. However, we do believe that some of their pricing power has come simply simply by their access to GPUs. As GPU supply constraints ease, there may be some pricing pressure to rightsize value provided by this layer vs. the underlying GPUs. We believe that there could be opportunities that arise for new specialized inference-as-a-service layers, potentially including things like:

Batch Inference: Batch inference enables higher GPU utilization and lower costs by processing large volumes of inputs in parallel, making it well-suited for non-latency-sensitive workloads like embedding generation or document summarization—and as GPU supply improves, it could create opportunities for specialized inference platforms offering optimized batch processing services.
Inference Time Compute: As models increasingly rely on inference-time reasoning and tool use, new frameworks may be needed to manage and optimize this emerging variable—creating opportunities for platforms that offer experimentation environments, cost control, and dynamic compute allocation tailored to inference-time decision paths.
Agents: In a world of autonomous agents and agent swarms, the number and diversity of models invoked per task could increase dramatically—driving demand for inference layers that support dynamic orchestration, fine-grained visibility, and efficient handling of concurrent model execution.
Low-Rank Adaptation & Fine-Tuned Model Hosting: As parameter-efficient fine-tuning methods like LoRA proliferate, platforms will be needed to manage, serve, and route among thousands of customized models per enterprise or user. Specialized infrastructure that supports fast weight merging, caching, and version control could enable scalable fine-tuned model delivery at inference time.
Multi-modal Orchestration: With the growth of multi-modal models spanning text, image, video, and audio, inference platforms that can intelligently route and batch heterogeneous inputs while optimizing resource allocation across modalities will be well-positioned—especially in use cases requiring high throughput and latency consistency.
Privacy-Preserving or On-Prem Inference: Inference for sensitive or regulated data will increasingly require solutions that guarantee locality, auditability, and privacy. Specialized providers focused on compliant inference—via confidential compute, zero-trust setups, or sovereign cloud deployments—may maintain pricing power as general-purpose inference becomes more commoditized.

‍

As we said in the prior stack overview -- Let's build it together! If you are building in this space we’d love to hear from you. Reach me on LinkedIn. Thanks to Richard, my co-author, for collaborating with me on this. You can find him at thelis.org

Read Full Article

The AI Inference Stack

May 27, 2025

If you want to deploy a GenAI powered app or agentic experience, you need to manage not just cost and control but to fine-tune every aspect of quality. To make that work, you need a robust AI inference stack—across training, eval, and production.