All articles
Inference Economics Will Decide Whether Next-Generation Recommendation Systems Reach Production
Surajiwale, Staff Machine Learning Scientist at Etsy, explains why recommendation systems are shifting from pointwise prediction to LLM-based generative architectures.

The future of serving next-generation recommendation systems comes down to one thing: can you make the economics work from an inference standpoint?

The recommendation systems running in production at most major tech companies are the final, most refined version of an architecture that has been incrementally improved for over a decade. The underlying approach, matching a query and a set of user signals against items to predict click or purchase likelihood, has stayed consistent even as the models underneath it grew more powerful.
That paradigm is now changing. A new generation of recommendation systems uses the same autoregressive architecture that powers LLMs, predicting the next item a user will want the same way a language model predicts the next word. The infrastructure required to serve these systems at scale looks nothing like what came before, and the question of whether they reach production depends almost entirely on inference economics.
Mukul Surajiwale is a Staff Machine Learning Scientist at Etsy, the global marketplace for handcrafted and unique goods. He was a founding member of HubSpot's AI group, led applied LLM research at Shopify, and built multimodal retrieval systems at Writer. His career spans the full arc from classical ML recommendation systems through deep learning to the current generative frontier.
"The future of serving next-generation recommendation systems comes down to one thing," Surajiwale says. "Can you make the economics work from an inference standpoint?"
The paradigm shift
Surajiwale describes current production systems as the "ultimate V10, final edition" of what he calls the Gen 2 architecture: deep neural networks making pointwise predictions about whether a user will engage with a specific item given a query and user history. That approach has served the industry well, but the contours of the problem itself are now changing.
"We're kind of on this new frontier," Surajiwale says. "These recommendation systems use the same architecture that things like ChatGPT or Claude use in order to predict the next token. We use that same methodology to predict items or things that people would want to view." The shift moves model parameter counts from tens of millions or hundreds of millions into single- and double-digit billions.
That makes the inference infrastructure fundamentally different: GPU-based compute, specialized inference chips distinct from training hardware, and architecture designed specifically for fast serving. Surajiwale expects that within one to two years, every company doing recommendations will have at least invested in this approach, if not shipped it.
Long-term meets short-term at the buyer profile
Etsy's recommendation challenge illustrates why the shift matters for production systems. The marketplace carries handcrafted inventory that buyers often do not know exists. Effective recommendation requires understanding not just what someone searched for in the current session, but their long-term aesthetic preferences, purchase history, and latent interests.
"It's actually a very difficult problem," Surajiwale says. "The better buyer profile that we can have, the better we can do proactively suggesting things that you didn't even know you wanted or didn't even know to search for." His team models long-term and short-term intent separately, with distinct models contributing to the final ranking. Feature freshness matters, but not uniformly: some signals need real-time updates while others can refresh weekly or monthly. "Any amount of feature drift is potentially a missed opportunity," Surajiwale says. "But not everything needs to be refreshed immediately."
Migration only makes sense when it enables the impossible
Surajiwale applies the same pragmatism to infrastructure decisions. When evaluating whether to migrate from a proven database to a vector-oriented system, the bar is not marginal improvement. It is whether the new system enables capabilities that are technically impossible on the current stack.
"The return on investment can't just be that it makes the existing system a little bit better," Surajiwale says. "It needs to allow you to build new capabilities or features that your current system doesn't allow because of a technical limitation." Vector databases designed for scalable embedding search and approximate nearest neighbor retrieval clear that bar for retrieval-augmented generation use cases. Incremental speed improvements on existing workflows do not.
The constraint that determines whether the generative recommendation paradigm reaches production is not model quality. The models already work. It is whether the inference layer can serve them at production scale, at production latency, at a cost that the business can sustain. "You can have the best model architecture," Surajiwale says. "But if you can't serve it at scale in production, then it's not going to work. Models have always been limited by infra."




