AI bots ignore evidence. Can we trust them with science?

A simple pen trick posed to major chatbots exposed a stubborn failure: even when shown live evidence, the systems didn’t update their predictions. The same problem shows up in a larger study of AI agents doing laboratory reasoning, where most tasks included ig
A pen held perfectly level with both hands should, on paper, pivot downward the moment one end is released. That’s the ordinary physics answer—and it’s exactly what several popular chatbots told YouTuber FatherPhi.
But FatherPhi didn’t stop at the prediction. He filmed himself performing the experiment and showed it to each chatbot in a live video. When he released one side, the pen stayed out horizontally, and he was able to hold it there with just one hand.
Then came the surreal part: FatherPhi asked ChatGPT. “What just happened?” The bot replied that it saw the pen rotate exactly as expected. In other separate videos, the other chatbots struggled in similar ways. They could identify the pen’s color and brand. but they could not update their predictions based on the new evidence FatherPhi showed them.
This wasn’t presented as a vision failure. Something subtler was breaking the link between what the system saw and what it concluded.
Walter Quattrociocchi, a computer scientist at Sapienza University of Rome, argues the lesson goes beyond one quirky demonstration. Developers could train a chatbot to give the correct answer to this particular pen problem. Quattrociocchi said—but that wouldn’t fix a broader issue: the typical behavior of these systems to fail to incorporate new data while working through a problem. If models keep treating evidence as optional. their performance in science. medicine. and other high-stakes settings may not be as reliable as people assume.
A more rigorous test of that concern appeared in a recent study of AI agents designed to reason through common chemistry scenarios. The researchers evaluated whether these agents could actually work like scientists.
The study focused on AI agents built on top of underlying large language models. The agents could act like an “Iron Man suit. ” linking an LLM to tools so it could perform tasks independently—calling out to external software. simulated experiments. and. crucially. tools that could run real lab equipment.
The researchers gave the agents laboratory reasoning tasks, such as determining which chemicals are present in a mystery solution. They then tracked what happened at each step across 619 scientific reasoning tasks performed by the AI agents.
The results were blunt. In 68 percent of those tasks, the agents ignored evidence at least once. They made claims without any supporting evidence in 53 percent of the tasks. And they successfully used contradictory evidence to change their output only 26 percent of the time. The team reported on April 20 on arXiv.org.
One of the study’s experiments involved materials scientist N. M. Anoop Krishnan’s team connecting an AI agent to an atomic force microscope. The agent gathered its own evidence as it reasoned through questions related to chemistry research, with a screen showing the process.
Krishnan. a materials scientist at the Indian Institute of Technology Delhi in India. said human scientists follow an “iterative process” of coming up with a hypothesis. designing and performing experiments. and then revisiting those initial ideas as evidence accumulates. “That’s not the case with AI,” Krishnan said. “Even when you have clear evidence that shows that a particular line of investigation is not correct. [the AI] refuses to change the hypothesis or the plan.”.
Kevin Jablonka. a study coauthor who leads a lab studying AI in materials science at Friedrich Schiller University Jena in Germany. framed the problem as a matter of trust in process. not just outcomes. “In science. you can’t typically trust a result unless you also trust the process it took to get there. ” Jablonka said. He added that a “transparent and meaningful” process is essential.
The study’s benchmark. Quattrociocchi said. goes “a little bit beyond the classical idea of benchmark.” Most benchmarks measure results alone—did the system land on the right answer. Here. Krishnan. Jablonka and colleagues built a benchmark that checks the process the AI agent uses on the way to an answer.
The question now is whether “reasoning models” change anything. AI companies have developed what they call reasoning models—language models trained to break questions down and follow step-by-step processes. Once trained. they can output text at each step of their process. supposedly describing how the system is “thinking.” They can be paired with agent tools or used alone.
But Subbarao Kambhampati. a computer scientist at Arizona State University in Tempe. warns that the sense of thinking can be an illusion. In a 2025 lecture, he asked the audience to imagine talking to a fitness trainer over the phone. If the trainer tells you to do 10 crunches. you might make some noises that sound like you’re working hard and then say you’re done. You didn’t do it, but the trainer can’t tell otherwise. Similarly. Kambhampati previously said. reasoning models may be imitating what people say as they think through problems. without any actual reasoning.
“In general, telling whether a system is actually doing reasoning to solve the reasoning problem or using memory to solve the reasoning problem is impossible,” he previously told Science News.
Kambhampati and others point to patterns that don’t fit real scientific reasoning: models can get intermediate steps right but the final answer wrong. or the reverse. Models trained on nonsense reasoning steps can still produce right answers. And if the system can’t be trusted to verify its own intermediate process. the stability of its conclusions becomes harder to defend.
As for what that means for AI agents paired with reasoning models, researchers say it remains to be seen how they’ll perform on the new benchmark. But the skepticism isn’t waiting around.
For Jablonka and Krishnan, the bottom line is not that these systems should be discarded. They argue AI agents that combine tools. large language models. and reasoning models can be useful in science when the task is well-defined—when “we know exactly what we want. ” Jablonka said. and Krishnan notes that open-ended scientific reasoning is still beyond the technology’s current reach.
Yet this view clashes with what many companies push, according to Quattrociocchi. He says the message from big tech—and even parts of the scientific community—is that a new form of intelligence is emerging that will make people better. Quattrociocchi doesn’t see that.
He worries that today’s systems produce words and other content based on statistics without verification, and that this corrodes how knowledge is built. “The architecture of knowledge as we have known it until now is under attack,” he said. “Actually, I’m scared.”
Krishnan and Jablonka lean the other direction. If researchers understand the limitations of AI agents and reasoning models, Krishnan said, “we can actually improve [the technology] and lead it towards enabling meaningful and disruptive discoveries.”
For now, though, the pen experiment—and the 619-task benchmark built around whether evidence gets ignored or acted on—leaves a simple demand hanging in the air: show us the reasoning that can revise itself when reality changes.
AI agents large language models reasoning models scientific reasoning evidence updating chemistry AI atomic force microscope Sapienza University of Rome Friedrich Schiller University Jena arXiv
Seems like a setup. If it was “live evidence” then why didn’t it magically know? AI is just guessing anyway.
I don’t get how they can show proof on video and the bot is still like “yep same outcome.” That’s kinda scary if you’re trusting it for science stuff.
Wait so the pen didn’t drop like physics says? Or did the guy just hold it differently? Either way, the bot saying it “saw it rotate” feels like it’s making up what happened instead of actually learning from the clip.
AI bots ignoring evidence is why I don’t trust anything that says it’s “reasoning.” Like people keep acting like these chat things are scientists when they can’t even adjust a basic pen trick. Also I’m not sure why it needed a YouTuber to test it… sounds like the whole point was already obvious?