Claude “AI shrinkflation” explained: Anthropic fixes harness changes

1 4 minutes read

Claude “AI shrinkflation” explained: Anthropic fixes harness changes

Claude harness – Anthropic says Claude’s perceived decline came from three surrounding “harness” changes. It has reverted key settings, fixed a caching bug, and reset limits.

For weeks, developers reported a troubling pattern: Claude appeared to get worse mid-task—less consistent reasoning, more token waste, and outputs that felt harder to trust.

On the surface, “degradation” stories around AI models can sound like vibes.. In this case. Misryoum saw the claims harden into a technical argument: users compared their own before-and-after experiences and circulated session-level observations. while others pointed to benchmark shifts.. The result was a widening trust gap—especially among teams using Claude Code and related tooling for sustained engineering work.

Anthropic’s response is important because it reframes the issue.. The company says the underlying model weights did not regress and that its API and inference layer were unaffected.. Instead. Misryoum’s reading of Anthropic’s post-mortem is that the problem lived in the product “harness”—the system prompt rules. caching behavior. and operational defaults wrapped around the model.

The timeline of the controversy gained momentum earlier in April 2026. when engineers began publishing deeper breakdowns of Claude sessions and tool calls.. One thread of analysis argued that reasoning depth dropped sharply. leading to reasoning loops and a tendency to chase the simplest path that still looked correct—until the task got complex.. Users also complained that limits seemed to be consumed faster than expected. feeding the suspicion that performance was being traded for throughput.

The three “harness” changes Anthropic says caused the quality drop

Anthropic identifies three separate product-layer changes that, in combination, made Claude behave less like the “research-first” assistant some teams expected.

First was the Default Reasoning Effort change on March 4, when Claude Code’s default moved from high to medium.. Anthropic’s rationale was interface-focused: reduce UI latency that can make tools appear frozen while the model works.. But the side effect, according to the company, was a noticeable drop in complex-task performance.. When reasoning effort is reduced. the model can still produce answers—yet it may do so with fewer internal checks. which is exactly where sophisticated coding and multi-constraint problem solving often depends on depth.

Second was a caching logic bug shipped on March 26.. Misryoum’s takeaway from Anthropic’s explanation is that a performance optimization meant to prune “thinking” from idle sessions misfired: instead of clearing thinking history once after an hour of inactivity. it cleared it on every turn after the first subsequent interaction.. That kind of state loss doesn’t always break a single response.. It breaks *continuity*—and continuity is what makes longer agentic workflows feel coherent rather than repetitive.

Third was a system prompt verbosity limit introduced on April 16.. The company added instructions to keep tool-call text under 25 words and final responses under 100 words in an effort to reduce verbosity in Opus 4.7.. Anthropic says this backfired, producing a measurable decline in coding quality evaluations.

Why these changes feel like “nerfing,” even when weights didn’t change

From a user’s perspective. the effect is what matters—especially when you’re using an AI system as a tool rather than a toy.. A model can preserve the same core capabilities while the wrapper around it changes how it plans, remembers, and communicates.. Misryoum sees this as the core reason “AI shrinkflation” took off as a metaphor: the product *still sells* the same experience on paper. but the daily output quality and efficiency shift.

Reasoning effort defaults, caching behavior, and system-prompt constraints can all alter the model’s workflow without touching model weights.. That’s a different kind of regression—one that’s operational and behavioral.. It also explains why some users felt the degradation across multiple interfaces (like Claude Agent SDK and Claude Cowork) while the Claude API reportedly remained unaffected.

There’s also a broader industry lesson here.. As AI products mature, “quality” isn’t solely a model-side property; it’s a pipeline property.. UI responsiveness improvements. token-management rules. and caching strategies are legitimate engineering goals—but they can create subtle failure modes when they interact with long-context tasks. tool use. and multi-step reasoning.

Anthropic’s fixes and safeguards—and what developers should watch next

Anthropic says it has resolved the issues by reverting the reasoning effort change and the verbosity prompt. and by fixing the caching bug in version v2.1.116.. For teams who rely on consistent behavior in production workflows. those reversions matter because they reset the conditions that triggered the observed decline.

The company is also moving toward stronger internal validation and change control.. Misryoum notes three safeguards in particular: expanded evaluation suites (including per-model evaluations and ablations for prompt changes). tighter gating so system changes are limited to intended targets. and enhanced internal dogfooding that requires staff to use the exact public builds.. Together, these steps are meant to catch regressions before developers feel them as lost trust.

There’s one more practical step: Anthropic has reset usage limits for all subscribers as of April 23. aiming to address token waste and performance friction caused by the bugs.. For end users. that’s not just a refund gesture—it’s a signal that operational issues are being treated as product reliability failures. not side effects.

Looking ahead. Anthropic says it plans to use its @ClaudeDevs accounts to provide deeper reasoning behind product decisions and keep a more transparent dialogue with its developer base.. If that becomes consistent. Misryoum expects it could reduce the cycle of rumor-to-benchmark-to-outcry that often follows behavioral regressions in fast-moving AI stacks.

For developers, the immediate takeaway is simple: when you see “degradation” claims, don’t assume the weights changed.. Ask what changed in the harness—defaults, caching, prompt constraints, and tool-call behavior.. In this episode, Anthropic’s own diagnosis suggests that even small wrapper changes can compound into a noticeably worse experience.

Ana Souza 2 hours ago

1 4 minutes read