AI was supposed to prevent downtime. Instead, it’s creating new kinds of outages

1 5 minutes read

AI was supposed to prevent downtime. Instead, it’s creating new kinds of outages

AI-related downtime – A Splunk report warns that AI systems aimed at eliminating downtime are increasingly linked to new failure modes, including model drift, buggy production deployments, and “broken integrations.” The survey finds downtime losses are climbing, executives struggle

For years, enterprise leaders treated downtime like a solvable problem: automate more of the right work, reduce human error, and design systems that catch issues before customers feel them. The pitch for AI was operational certainty—fewer interruptions, fewer mistakes, fewer surprises.

But a new Splunk report on AI-related downtime suggests the reality is turning messier. Half of surveyed organizations say they’ve experienced downtime tied to incorrect AI automation or model drift. Nearly one-third blame bugs introduced by embedding AI into production systems. And the costs, the report says, are mounting fast.

Splunk, a unit of Cisco, commissioned the study with Oxford Economics across 2,000 executives of Global 2000 companies. It estimates that unplanned downtime now costs businesses $600 billion annually, up 50% in just two years. The report puts the price at about $15,000 per minute of downtime. It also estimates businesses lose an average of $300 million annually before the interruption is formally considered a crisis.

The figure that resonates is the gap between expectation and experience. Splunk calls it a “reliability paradox”: the more aggressively companies deploy AI to eliminate operational risk, the more they end up managing a newer, less predictable kind of it.

Kamal Hathi. senior VP and general manager of Splunk. puts the problem in plain terms: “Organizations are deploying AI into mission-critical systems without clearly defined escalation paths.” He says they lack monitoring tuned to detect model drift and that there’s no clear ownership when things go wrong.

The financial fallout extends beyond IT budgets. Hathi notes that stock prices drop an average of 3.4% per major incident. He also points to ransomware payouts reaching nearly tripled levels of $40 million, and regulatory fines averaging $51 million.

AI was marketed as a risk reducer. But the report’s central warning is that speed is reshaping failure itself.

As AI expands from copilots and chat interfaces toward autonomous agents. companies are sometimes moving faster than the operational safeguards can keep up. Hathi says businesses aren’t so much misunderstanding AI’s value as underestimating what responsible deployment requires. He compares the mindset to treating AI deployment like a software upgrade. even though AI learns from shifting environments and interacts with systems in ways that don’t follow deterministic logic. “Resilience can’t be an afterthought,” he says—meaning the ability to absorb disruption, recover quickly, and maintain continuity.

The report reports a stark mismatch in how companies view the systems they’re rolling out. It finds that 44% of organizations use agentic AI, while 68% worry those systems may behave unpredictably. At the same time, AI-targeted attacks are evolving. Prompt injection and data poisoning—two ways malicious actors manipulate what an AI system sees or learns to alter its behavior—are on the rise. Nearly one in four organizations has encountered them. and 77% of technology leaders believe cybercriminals armed with generative AI will increase downtime at their organizations.

Hathi says autonomy can’t be granted all at once. “Agentic systems need to earn their autonomy incrementally,” he says. “They must be governed by visibility and accountability at every step — not deployed at scale and monitored retroactively.”

Down the chain, Splunk describes a failure style that often doesn’t look like an outage at all.

Greg Leffler, director of developer evangelism and lead evangelist at Splunk, says AI-related downtime rarely resembles a traditional collapse. Instead of a single dramatic break, it can show up as a slow erosion of system behavior that spreads before anyone starts looking closely.

He points to two recurring patterns. The first is model drift. described as “an automation pipeline making correct decisions six months ago whose training data no longer reflects current traffic.” By the time anyone notices. he says. the damage can already be moving across interconnected services. The second pattern is broken integrations. where an AI system acts on incomplete data and triggers a chain of failures across connected systems that no single team fully owns or monitors end to end. In both cases, confidence degrades gradually until something critical finally tips over.

Leffler argues that teams too often deploy AI with the assumption that it is self correcting—an assumption traditional infrastructure never allowed. “The engineering discipline applied to software releases—staged rollouts. canary testing. rollback procedures—must now apply to every production model carrying decision-making authority. ” he says.

Yet the sharpest finding isn’t about whether AI can perform. It’s about whether anyone can identify what went wrong. Only 38% of surveyed technology executives reported consistently identifying the root cause of downtime incidents, despite heavy investment in monitoring platforms.

Leffler explains why that’s happening. As automation absorbs more routine operational decisions. fewer engineers build the deep system intuition needed to diagnose failures when automation breaks. And modern tech stacks often lean on external AI providers and third-party services, leaving teams with limited direct visibility. He calls this a compounding opacity problem: interconnected risk layers that sit largely outside what can be observed.

He argues that agentic systems should be able to diagnose issues independently. execute routine fixes. and perform code rollbacks—but escalate higher-stakes decisions for human approval. “Agentic systems should independently diagnose issues. execute routine fixes. and perform code rollbacks—but escalate any higher-stakes decision for human approval. ” Leffler says.

That technical answer also lands on a cultural one. “If engineering teams aren’t measuring reliability with the same rigor they measure velocity, governance frameworks will always lose to ship timelines,” he adds.

There’s another risk layer the report says is harder to count: AI used beyond the official stack.

Splunk describes “shadow AI” as a higher-stakes cousin to the earlier era of shadow IT. Earlier unapproved tools—software. cloud services. or collaboration platforms—created security and compliance problems. but they typically didn’t shape operations in the same quiet way. Shadow AI changes the equation because it can influence operational behavior while leaving little trace of how or why decisions were made.

The report says 66% of organizations report employees using unapproved AI tools at work to write code. generate business outputs. and automate decisions. It notes this often happens without centralized visibility into what data those tools access or how their recommendations influence production environments.

Hathi frames the challenge as more than a single gap in policy. “It’s all three: a policy problem, a visibility problem, and a governance problem,” he says. He argues policy alone won’t solve it and says organizations need an evaluation system for what AI should do. backed by a telemetry layer grounded in logs. metrics. and traces.

The thread tying these findings together is the same uncomfortable reality: downtime costs are rising. AI is being embedded into mission-critical systems. and the mechanisms for oversight and escalation often lag behind deployment. Model drift. broken integrations. opaque third-party dependencies. and shadow AI all point to failures that can develop quietly—until they aren’t quiet anymore.

Hathi suggests the competitive stakes are now about operational control. “Every competitor now has access to similar models and cloud infrastructure,” he says. “Resilience, governance, and observability are becoming the real differentiators. The enterprises that internalize that first will define what operational excellence means in the AI era.”.

Splunk Cisco AI downtime model drift agentic AI observability monitoring reliability paradox prompt injection data poisoning ransomware payouts regulatory fines shadow AI enterprise IT

Sarah Walker 51 minutes ago

1 5 minutes read

Leave a Reply Cancel reply