XBOW tests Anthropic’s Mythos Preview for offensive security

4 8 minutes read

XBOW tests Anthropic’s Mythos Preview for offensive security

XBOW tests – XBOW says Anthropic’s Mythos Preview markedly improves vulnerability discovery—especially when source code is available—while also showing clear limits around exploit validation, judgment, and live-site constraints. In XBOW’s security “gauntlet,” Mythos Previe

For XBOW’s team, the moment came when the model stopped sounding impressive in theory and started producing leads that their security work could actually stress-test.

About three months ago, Anthropic invited XBOW to help assess the capability of a new model it believed represented a significant shift in capability. XBOW put Mythos Preview through its security gauntlet—benchmarks, workflows, interactive use, and integrations.

Three months turned into weeks of trial and calibration. And now, after early access to Mythos Preview for “early capability testing,” XBOW is publishing what it says it found—down to how the tests were run and where the model’s strengths and weaknesses showed up.

XBOW’s headline takeaway is blunt: Mythos Preview is “a major advance.” It is substantially better than prior models at finding vulnerability candidates. particularly when source code is available. In XBOW’s account. the model communicates with unusual technical precision. reasons well about code. and shows strong promise in complex areas like native-code analysis and reverse engineering.

But XBOW also draws a line between useful vulnerability leads and the hard work that follows. “It’s not magic. ” the evaluation stresses—because “a model is a brain without a body.” Source code audits are mostly “brain activity. ” XBOW argues. while live site pentests need a “body” with skill and control that matches the model’s power.

The testing work was designed to measure that gap.

XBOW built a diverse team of 10 experts from different parts of the company to assess the model from different directions. Every model was tested with XBOW’s internal benchmarking system used to analyze Opus 4.7 and GPT 5.5. The system uses open-source applications where vulnerabilities were previously discovered. freezes them at the vulnerable version. and runs XBOW’s agents against them.

This time. XBOW broadened beyond the standard measurement angles and evaluated: the model’s judgment regarding threat modeling. vulnerability validation. and safety; its ability to read source code versus interact with live systems; and its ability to find exploits the team was not yet looking for in its standard assessments. including native app vulnerabilities.

A terminology note runs through XBOW’s report because “Mythos” can mean different things. In this evaluation. XBOW explored Mythos Preview both inside Claude Code and as a raw model using it via its API as an engine for XBOW’s agents. XBOW separates these cases because orchestration, tools, prompting, and live-site access “materially affect outcomes.”.

In interactive use, XBOW testers say the model delivered something close to “just go and find something.” One tester, quoted in the evaluation, said: “This is a lot closer to `just go and find something` than anything I’ve seen so far.”

XBOW then gave it its own source code, and says Mythos Preview found weaknesses. The report describes the findings as “nothing truly terrible,” but also says there were several items the team wanted to repair.

When XBOW tried the model on open-source software, the pace accelerated: by the end of week one, XBOW says it had “quite a few new vulnerabilities” that had to be disclosed.

Benchmarks produced a different kind of reaction—less awe, more data. XBOW says the model’s results showed where it was “runaway powerful” and where it represented only a “modest advance.”

A key distinction is repeated throughout the evaluation: finding a vulnerability isn’t the same as proving it’s exploitable.

XBOW’s web exploit benchmark is built to test whether a model can help it find validated. actionable vulnerabilities in live website environments. A case passes only when the system finds a validated way to act on the vulnerability (PoC||GTFO) after a series of 80 “actions.” XBOW defines an action as something like a shell or a Python script using standard commands or XBOW’s suite of attack tools.

Opus 4.7 is not included in the chart because XBOW says Opus 4.7 interacts with its system in a unique way, making the stat “less relevant.”

Even so, compared to the newest model at the time—Opus 4.6—XBOW reports a strong increase for Mythos Preview. XBOW says the number of false negatives was cut by 42%. In a variation where XBOW gave both models the site’s source code, the cut was even larger: 55%.

XBOW says that theme held up across the work: Mythos Preview is impressive at writing code, but even more impressive at reading it.

The evaluation also describes how performance changed under different framing. XBOW says Mythos Preview finds vulnerabilities in significantly fewer iterations than Opus 4.6, though the gap to GPT-5.5 was less pronounced. XBOW further argues the story becomes clearer when looking token-for-token rather than action-for-action. noting that models could take many small steps or fewer large steps and that shouldn’t matter as much as output length.

XBOW also shifts the way people interpret results. Instead of relying only on mean pass rate—the probability of finding a vulnerability—it says it is “more instructive” to look at odds for discovery, the hit rate divided by miss rate.

Under those considerations, XBOW says Mythos Preview shows “absolutely unprecedented precision” in honing in on a vulnerability.

Still, XBOW insists that live-site validation remains the hard part.

The evaluation argues that many exploitable issues don’t show up as obvious defects in application source code. They emerge from configuration, dependencies, deployment choices, or the way otherwise safe components are combined.

XBOW uses a scenario to make the point: a dependency on its own could be safe, and the source code could be safe. But when the source code uses the dependency in an unsafe way, a vulnerability can appear.

That matters to XBOW’s pitch because XBOW runs pentests targeting a live site “the way an attacker sees it,” while Mythos Preview as used by tools like Project Glasswing is framed as excelling at auditing source code “the way a developer sees it.”

XBOW then tests the imbalance directly: because of how XBOW harvests its web benchmark set, it says it is possible to find the vulnerability from code alone in that set. So XBOW asks whether Mythos Preview can find something interesting without live-site access.

XBOW says that even for these benchmarks—where the vulnerability is purely in the code—removing access to the live site hurts performance more than removing access to source code. It says live-site access matters more than source-code access in those trials. reinforcing XBOW’s claim that it provides a “safe. structured way” for models to interact with real application behavior and prove which findings are actually exploitable.

At that point, XBOW frames the question that keeps defenders up at night: which findings are exploitable, reproducible, safe to test, and worth fixing.

In XBOW’s view, the answer lies in combining Mythos Preview’s source-code analysis with a live-site validation layer. It says the best detection pattern comes when XBOW orchestrates Mythos Preview to analyze source code for a lead. probe the live site to understand how the weakness reflects in deployment. and craft an exploit from it.

The evaluation adds another comparison: even though Mythos Preview suffers when denied live-site access, XBOW says other models suffer even more. That, it says, confirms Mythos’ greatest strength is reading source code.

The report also describes results across other areas. Mythos Preview’s judgment is described as mixed. XBOW says the model’s judgment across command safety. threat modeling. and trace triage is often careful and precise. yet also literal and conservative. It rejected false positives better than many predecessors. but sometimes lost true positives when the evidence didn’t satisfy its criteria or when the intended rule was broader than the written one.

XBOW singles out a command safety benchmark as a moment that “slightly shocked” its testers. The benchmark asks models to consider whether a given script is safe to execute without impacting the target site. XBOW says it handed-labeled a large set of example cases near the edge of the decision boundary. In that evaluation, Haiku 4.5 achieved 90.1% accuracy.

XBOW says it optimized prompts for Haiku 4.5, so it compares against Opus 4.6, which had 81.2% accuracy. Mythos Preview, XBOW says, had only 77.8%. When XBOW looked deeper into reasoning. it says the model often had a point—technically not against the letter of the rules. but against the spirit. XBOW contrasts this with Opus 4.6, which it says prioritized the spirit, while Mythos prioritized the letter.

Outside web applications, XBOW says the model showed substantial strength in native-code vulnerability discovery and reverse engineering. In Chromium-related testing, it says Mythos Preview found more real bugs with fewer false positives than prior baselines. In V8 sandbox work. it says Mythos Preview identified true positives in a subtle threat model where previous approaches had produced many findings but no successful true positives. XBOW also says it was capable of triaging both its own results and competitor-model findings.

For reverse engineering, XBOW calls the results among the most striking, saying the model reasoned through unusual firmware and embedded systems contexts, including architectures and operating-system combinations that required more than rote pattern matching.

XBOW adds that browser interaction and visual acuity were strong enough for practical workflows. XBOW says its workflows often require models to interact with live websites through a browser interface and that visual acuity matters: the model needs to identify UI elements and click accurately. It says Mythos Preview performed “extremely well” on XBOW’s visual-acuity QA, roughly matching Sonnet 4.6 and dramatically outperforming Opus 4.6. XBOW says it was not perfectly pixel-accurate when asked for exact coordinates. but was practically effective at selecting the right browser actions.

The evaluation also notes Opus 4.7 “shone” at this benchmark, and suggests a different story than “Mythos Preview is good”: it says this could be an area where recent Anthropic models had begun to deteriorate, and that Anthropic had caught that deterioration and reversed it.

Then comes the cost reality check.

XBOW calls Mythos Preview a “true titan,” but says titans are expensive. It reports that Mythos Preview is not yet available over public APIs at the time of writing. It also says Anthropic mentioned Mythos Preview would be 5x as expensive as an Opus model. already among the more expensive options token-for-token.

XBOW asks what that means in practice: whether an agent powered by a different model given more time could deliver better accuracy for less cost.

XBOW says that, once it normalizes by estimated running cost, the picture is clearer. Mythos Preview is not “terribly inefficient” if someone wants high accuracy, but it is not “best-in-class” on XBOW’s benchmarks either.

XBOW ties that to a comparison it references: it says Point Estimate’s analysis of AI Security Institute benchmarking of Mythos Preview vs GPT-5.5 suggests Mythos Preview is powerful. but the real choice is to pay for an agent to use Mythos Preview for a bit or to use GPT-5.5 for as long as needed. depending on the use case.

XBOW’s closing argument is that frontier models have taken a major step forward in vulnerability discovery. It says Mythos Preview is strong at finding candidate vulnerabilities, especially from source code, and shows impressive ability across web, native-code, and reverse-engineering tasks.

But XBOW says the model needs the right harness and tools to reach its full potential. It also says it should be one of several options, because in some tasks it may be more sensible to let another model try several times than to let Mythos Preview try once.

Such considerations, XBOW says, are part of why it maintains a cadre of models instead of restricting itself to a single one.

XBOW ends by inviting readers to see Mythos Preview’s vulnerability validation capabilities in practice, saying contact for a demo is available. The post is labeled “Sponsored and written by XBOW.”

Anthropic Mythos Preview XBOW cybersecurity offensive security vulnerability discovery exploit validation live-site pentesting source code auditing reverse engineering native-code security visual acuity command safety benchmarks

Ana Souza 1 hour ago

4 8 minutes read

4 Comments

Dylan Myers says:
June 10, 2026 at 4:42 pm
So wait, they tested a model to find hacks? Cool cool.
Karen Thompson says:
June 10, 2026 at 4:44 pm
This sounds like it’s basically teaching someone how to break into things, just with extra steps. “Gauntlet”?? like a game. I dunno I’m not buying the safety vibe.
Marco Alvarez says:
June 10, 2026 at 4:46 pm
They said it’s better at finding vuln candidates especially with source code… so it’s only useful if you already have the code, right? which makes me think it’s not that big a deal. But also they mention “interactive use” so maybe it can still figure it out without it, I’m confused.
Tiffany Jenkins says:
June 10, 2026 at 4:48 pm
Honestly I don’t trust any of this. First they’re like “major advance” and “unusual technical precision,” then it says limits around exploit validation and live-site constraints. That just sounds like it stopped at the part where it becomes illegal lol. Also “early capability testing” feels like they want credit before anyone can say it’s dangerous.

4 Comments

Leave a Reply Cancel reply