All articles
Production AI Demands Statefulness That Might Require a Whole New Approach to the Inference Layer
An Intel AI engineer and the CEO of a YC-backed inference startup are converging on the same conclusion: inference is becoming a state management problem, and the engineering required to solve it has been around for decades.

Teams need to stop treating KV cache as an implementation detail inside the inference server and start treating it as an infrastructure component.

The most widely deployed open-source inference server in production today solved its core scaling problem by borrowing from a 1960s operating system textbook. vLLM's PagedAttention found that inference systems were wasting 60 to 80 percent of GPU memory on fragmentation, and fixed it by treating the KV cache the way an OS treats RAM: partition into blocks, allocate on demand, reclaim what's idle. The insight came from virtual memory paging, not from ML research.
That's probably the most useful signal available right now about where production AI infrastructure is actually going. The inference layer was built to be stateless: request in, tokens out, cache discarded. That held for chatbots. It buckles under multi-turn agents, long-context RAG where prefill dominates the inference bill, and real-time voice applications where 200 milliseconds of added latency breaks the product. Inference is becoming a state management problem, and the practitioners solving it are reaching for engineering that predates the models by decades.
Vamshi Nagireddy is a Senior Software Engineer at Intel, where he works on AI workload optimization and inference performance across the company's hardware stack. He's published technical deep-dives on vLLM architecture, GPU hardware fundamentals, and KV cache quantization techniques, and to him, the vLLM precedent isn't a one-off. The most consequential architecture decision in modern inference serving was borrowed from OS design, and the engineers who made it were thinking about memory management. "That's the meaningful signal about where the field is headed. The inference problems that matter at scale are systems problems," Nagireddy says.
Storage engineering in disguise
Teams are conflating two fundamentally different optimization problems, Nagireddy argues, and the confusion is costing them. "Quantization is a representation problem. You're asking how few bits you can use to approximate weights without destroying model quality. It's static, an offline decision." KV cache optimization, he says, is a different animal entirely. "You're asking how to manage the memory lifecycle for activations that are dynamically generated at runtime. There's no offline calibration phase. The pressure is on latency, memory bandwidth, and concurrency."
The KV cache doesn't behave like model weights. Weights are relatively stable. Activations drift as context grows, and in RAG settings they're highly input-dependent. Standard calibration techniques assume a stable distribution, and that assumption breaks at exactly the moment it matters most: under long contexts and heavy load. "You need a block allocator. You need reference counting. You need an eviction policy. You need to understand bandwidth characteristics, working set versus cold data." The building blocks, says Nagireddy, are storage engineering. The data just happens to be attention keys and values.
Old caching toolkit, new caching contract
Traditional database caches operate on byte-level locality. Two queries that access adjacent rows share physical proximity, and the cache exploits that proximity for speed. KV cache access patterns work differently. Two requests that share a system prompt share meaning, not just bytes.
Nagireddy argues that this semantic structure is what makes the state management challenge genuinely new, even as the engineering toolbox remains familiar. "Prefix sharing isn't just an optimization hint. It's a semantic property of the data," he says. "The architecture question becomes how you build a cache that's aware of semantic locality, not just byte-level data."
That's where concepts like prefix hash trees start to matter, along with eviction policies informed by the model itself about what context it actually needs. The optimization target shifts from "what data was accessed recently" to "what data is semantically relevant to the current computation." The engineering patterns are decades old. The data they're operating on is brand new.
"Teams need to stop treating KV cache as an implementation detail inside the inference server and start treating it as an infrastructure component. Something you monitor, something you tune, something you have an operational model for. The same way you have a caching strategy for your database layer, you need an explicit KV cache strategy for your inference," says Nagireddy.
He points to where chip manufacturers are placing their bets as further evidence. Nvidia is investing in NVLink bandwidth scaling and high-bandwidth memory. AMD and Intel are doing the same. "The compute ceiling is rising faster than the memory bandwidth ceiling," Nagireddy says. "That imbalance is going to force the software stack to catch up."
Orchestration is the last mile
If Nagireddy sees the problem from the memory hierarchy up, Arko Chattopadhyay sees it from the customer deployment down. As Co-Founder and CEO of Pipeshift, a YC-backed startup providing managed inference infrastructure for real-time AI workloads, Chattopadhyay works with production teams whose inference costs and latency SLAs show up on the P&L.
"Orchestration is the core problem here. A much bigger bottleneck than people understand today," Chattopadhyay says. He compares it to logistics. "Everything can be great, with nearly everything solved. But if the last mile isn't solved, consumers aren't happy."
That last mile covers a sprawling surface area: capacity management across multiple cloud providers, cold start optimization, disaster recovery pools, distributed proxies that keep routing within the same region as the deployment, and shared KV cache pools across scaling replicas so that adding a new node doesn't mean warming up the cache from scratch.
Real-time workloads multiply every one of these problems. "You won't see a 200-millisecond increase in latency if it's not a real-time use case," Chattopadhyay says. "As soon as you move to real time and it's a voice agent or code generation where you're rendering React components, you notice that difference." Add spiky usage patterns that shift from morning to night, and the challenge compounds further.
Chattopadhyay sees a market shift underway. For the last two years, most teams accepted whatever SLA their inference provider gave them. Now the conversation has flipped. "We're seeing companies say, 'We do not want to deliver the SLA to our end customer that our inference provider is giving us. We want to deliver an SLA to our end customer and we need an inference provider that can orchestrate the infra toward that.'" That shift moves inference from a commodity API call to a state management problem teams need to own.
When the model disappears
That kind of shift doesn't happen in a vacuum. Real incidents are driving it.
Chattopadhyay describes a customer who came to Pipeshift after a model deprecation went wrong. OpenAI has retired more models in 2026 than in all prior years combined, and the pace has caught teams off guard. "They didn't miss the deprecation notice," says Chattopadhyay. "They worked on it. They literally pushed their sprints for two weeks to move between model versions. But they ended up not integrating it in one piece of the product, and that went down the next day when the deprecation happened."
The fallback strategy most teams rely on, routing to a different model when the primary one disappears, doesn't actually solve the problem. "Swapping the model means new prompts. Swapping the model means quality differences. Swapping the model means different performance numbers," Chattopadhyay says. He calls model fallbacks "a band-aid for a bullet hole."
That fragility, combined with unpredictable cost growth as inference spend climbs faster than revenue, is pushing production teams toward dedicated deployments on open-source models. Chattopadhyay estimates that 80 to 90 percent of most teams' workloads can run on open-source at lower cost and with controlled SLAs, leaving only a thin edge-case layer on frontier models.
The metrics gap that keeps teams guessing
The most surprising thing Chattopadhyay encounters when onboarding new customers is a visibility problem. Teams that spent two years building sophisticated eval frameworks to measure model quality have almost nothing equivalent for their inference infrastructure.
"The most common thing we see is they don't know what their P90 usage is like, what their P99 usage is like," Chattopadhyay says. "How does their workload pattern shift through the day? Do they have cache metrics? How much of their input is actually being replicated across queries?"
Chattopadhyay frames the gap as a maturity problem the industry hasn't reached yet. "People had to spend two years burning their hands to get to a point where everybody has their own harness, everybody is building their own eval frameworks. This will be two more years of work for people to understand that they have to track all of these metrics on the infra layer."
When he does get a customer who has invested in inference observability, the onboarding is dramatically faster. "I take all of this data, go back and benchmark different GPUs and models, and I'll exactly know what to give them," he says. "When they don't have it, it becomes a test cycle."
The next inference server holds state
Nagireddy doesn't hedge on what comes next. "The inference server gets rebuilt first," he says. "And it's already starting."
Current inference serving systems, vLLM and TensorRT-LLM included, were designed with the assumption that KV cache is ephemeral and local. It lives on the GPU. It dies with the request. That assumption, Nagireddy argues, is now "load-bearing in the wrong direction." Production demands state persistence, prefix sharing across requests, and cache-aware routing. None of that bolts onto a system that was never designed to hold state.
"I think we'll see a new generation of inference servers in coming months that looks much more like a database runtime than a model serving framework," Nagireddy says. Serving logic and memory management become equal concerns. Routing becomes a data locality problem: send the request to the instance that already has the relevant prefix cached, rather than round-robin load balancing across nodes that each start cold. The orchestration layer, whether Kubernetes or a custom scheduler, currently has no visibility into KV cache state when making those decisions. That has to change.
Nvidia is already treating this as a product requirement. At GTC 2026, the company announced CMX, a context memory storage platform that introduces a new storage tier between GPU memory and traditional network storage, built specifically for persisting KV cache and agent state across inference sessions. The shift Nagireddy describes isn't speculative. The largest chip company in the world is building infrastructure around the assumption that inference state needs to survive the request that created it.
"The team that wins in this space comes from someone who deeply understands storage systems, distributed systems, distributed caching," he says. "Someone who built a CDN, who looks at the problem and says, 'This is actually a solved problem. We're just applying it to a new data type.'"
The inference layer was designed to be stateless. Production AI demands state. The engineering required to close that gap is the oldest kind of infrastructure work in the stack, applied to the newest data type the industry has ever had to serve.




