All articles

Data Teams Catch Silent AI Failures By Tracing Systems Before Locking Them Down

The Data Wire - News Team

July 1, 2026

Sathish Kumar Subramani watches enterprise AI systems closely before governing them, since the costliest failures pass every check while quietly handing customers wrong answers.

Credit: The Data Wire

We think the system's working fine. Technically it's functioning, but logically it's failing. We can't govern what we can't see.

Sathish Kumar Subramani

Sr. Software Engineering Manager

The AI failures that do the most damage in production are the ones nothing catches. The dashboards look healthy, and the model still gives a customer the wrong answer with full confidence. Stopping that starts with watching how a system behaves before adding any controls on top. The common instinct is to put guardrails in first, before a team has any clear picture of what the system is doing in the wild.

Sathish Kumar Subramani is a Senior Software Engineering Manager at FineLine Technologies, a global RFID and barcode labeling provider for retail and supply chain operations. With more than two decades in software, he still works close to the code, and that hands-on view shapes how he approaches anything heading into production. "We think the system's working fine. Technically it's functioning, but logically it's failing. We can't govern what we can't see," Subramani says.

The analogy he reaches for comes from manufacturing. "When you build a car, once the factory inspection is complete and it's out on the road, we still trace its performance through dashboards, monitoring range and other metrics. That has to continue through the entire life of the car. We have to implement architectural governance the same way, across every layer," Subramani says.

For an AI system, that means wiring visibility into the API gateway, the data layer, and the points where users send input and get a response. Trouble can surface at any of them, so a single checkpoint leaves the rest exposed. A check built into the runtime catches it as it happens, while the same issue written into a policy document just sits there until someone thinks to look.

When nothing looks broken

To show the kind of error he means, Subramani describes a customer-support tool running on retrieval-augmented generation (RAG), pulling answers from a company's return policy. That policy covers physical goods. The company later adds digital products and no one revises it. So when a customer asks whether a digital purchase can be sent back, the model tells them to mail it in for a refund. Nothing errors out and the logs stay clean, so nothing prompts the team to look.

A crash gets noticed in minutes. An error like this can run for weeks, repeating the same wrong answer to everyone who asks, and it usually surfaces only when a complaint or a mistaken refund reaches the team. By then the damage sits in lost trust, well outside anything a dashboard tracks.

"When we move an application into production, we have to trace every piece of data. We track whatever input the user provides, and we audit the response the AI gives," Subramani says. In practice, that means logging the question a user asks next to the answer the model returns, then checking that the two line up against the current rules. The model follows the data it's given. What no one catches is the gap between a rule written for one kind of product and a question about another.

Guardrails come last for a reason

Most teams reach for controls the moment a system goes live. The sequence Subramani recommends runs the other way. "In phase one we implement basic log tracing. In phase two we expand those logs to observe what's happening across the system. Only in phase three, once we have full visibility, do we bring in the guardrails," he says.

A control added before a team understands the system tends to guard the wrong thing. The early phases earn that understanding, and they do double duty. Logging and observation also show where the data comes from and which version of a policy the model is drawing on, the lineage that decides whether an answer holds up. When the source data is stale or skewed, the model's output follows it, so the data and the model end up governed as one.

Once that view exists, the guardrails have somewhere sensible to sit. The highest-leverage spot is the entry point, since anything that gets past it is already inside the model's reach and harder to pull back. "Sometimes a user mistakenly enters their credit card or personal information, and that data should be masked before it ever reaches the LLM," Subramani says. A rail at the input catches that kind of slip and masks the sensitive field before the model processes it, so it never lands in a log or gets repeated back later. The same approach covers the response on the way out, and the handoff between the model and any tool it calls. Those checkpoints carry more weight once systems start acting on their own.

Zero trust becomes the only workable posture

Security has long pointed outward. Firewalls keep attackers out, and whatever sits inside the perimeter is trusted by default. An autonomous agent breaks that logic. It can call APIs and act on databases on its own, so the power to do damage now sits inside, in a component the system is built to trust. "With agentic AI, the agent itself can cause harm through its own actions. It can damage the application or delete the entire database. Because the agent creates the harm, we have to govern every action it takes," Subramani says.

The risk grows when agents start calling each other. One hands a task to a second, the second passes part of it to a third, and no person reviews any of those handoffs. The fix is to drop that default trust. Each agent gets only the access its task needs, and any action that is hard to reverse stays with a person. "Even if an agent drafts an email, sending that message has to remain a manual, human decision. Applying strict governance at that stage stops the system from autonomously sending out incorrect information," Subramani says.

An unsent draft can always be deleted, while a delivered message cannot be recalled. The deeper shift is in what the default looks like. Access is granted narrowly and widened only as an agent proves itself, drawing on the same visibility those earlier phases produce. "We have to prioritize zero-trust and least-privilege implementations, and add guardrails wherever possible. In the agentic era, the landscape has completely changed, so our approach to keeping systems safe has to change with it," Subramani adds.

Keeping the model in-house settles the risk

For all the complexity surrounding AI architecture, the primary hurdle stalling enterprise adoption is straightforward. The core anxiety for many leaders is what happens if their sensitive data gets out, and that worry frequently outweighs their confidence in the technology itself. "If data is exposed to the public, it's a massive issue. That's the exact point where everyone holds back on bringing LLMs into their production applications. I discuss this with multiple technology leaders, and many simply don't want to cross that line," Subramani says.

His answer removes the part that scares them. Running an open-source model on the company's own infrastructure significantly reduces the risk of data exposure, keeping sensitive information within the organization's controlled environment. "We can take a powerful open-source LLM like Llama and deploy it entirely within our own environment. It runs strictly on our own servers, which keeps sensitive data from being transmitted to third-party AI providers. This local framework has to be explained to mid-market technology leaders so they can finally move forward with AI adoption," Subramani says.

For the engineering and product teams he advises, this local framework is often what moves stalled projects off the whiteboard and into development. It gives wary technology leaders access to powerful AI capabilities without requiring sensitive data to leave their controlled environment, providing the reassurance needed for a successful first deployment. However, it does not eliminate risk entirely. Instead, it shifts the responsibility from the AI vendor to the organization's internal technology teams.

The views and opinions expressed are those of Sathish Kumar Subramani and do not represent the official policy or position of any organization.

All articles

Data Teams Catch Silent AI Failures By Tracing Systems Before Locking Them Down

Sathish Kumar Subramani watches enterprise AI systems closely before governing them, since the costliest failures pass every check while quietly handing customers wrong answers.

We think the system's working fine. Technically it's functioning, but logically it's failing. We can't govern what we can't see.

Sathish Kumar Subramani

When nothing looks broken

Guardrails come last for a reason

Zero trust becomes the only workable posture

Keeping the model in-house settles the risk

Related Stories

Inference Economics Will Decide Whether Next-Generation Recommendation Systems Reach Production

Healthcare's 'Silos of Silence' Are The Biggest Barrier To Clinical AI At Scale

Sovereign Compute and Shadow Grids Reshape Where AI Workloads Actually Run

Enterprises Lead The Next Phase Of AI By Squeezing More From Existing GPUs

As AI Competitors Close The Gap With Frontier Models, The Real Technical Debt Might Picking A Horse Too Soon