Anthropic’s Claude Code gets help from 1,000 engineers
Anthropic is improving Claude Code through a project internally known as “Marlin,” using feedback and evaluations from about 1,000 human software engineers. Contractors working on the effort create prompts, run A/B tests between model outputs, and iterate unde
For people who write code for a living, the newest AI coding tools can feel like they’re “getting it” faster—until you look behind the curtain.
An Anthropic project is using feedback from about 1. 000 human software engineers to improve the performance of Claude Code. the AI coding tool whose recent advancements have disrupted the vibe-coding industry. The effort. known internally at Snorkel AI as “Marlin. ” aims to fine-tune Claude Code’s answers so the tool can mimic the level of work a professional developer would produce.
Anthropic and other AI companies often outsource data work to third parties like Snorkel. Those firms hire contractors to teach models specialist subjects and perform tasks meant to improve outputs. Contractor interviews and training material tied to these projects offer a window into the largely unseen workforce operating across the world.
Two contractors working on the Anthropic project told Business Insider that they are being paid $280 per task to create prompts and review code. Each task takes about an hour, the contractors said, with some submissions needing additional back-and-forth with Snorkel’s approval layer. The contractors also said they did not know what version of the models they were evaluating.
The “Marlin” work is ongoing. Contractors were directed to A/B test code written by two different models—comparing the outputs and selecting which they preferred. following project guidelines from Snorkel reviewed by Business Insider. One contractor said the project aimed to help ensure the model could deliver the level of detail expected in the prompt. essentially training Claude Code to write simplified. easier-to-maintain code.
That shift mirrors a broader trend in AI training work. As AI systems become more capable. data-labeling platforms have moved from generalist tasks to increasingly specialized work that requires field expertise or postgraduate degrees. Snorkel’s website says it works with people with advanced degrees, including Ph.Ds, MDs, and JDs, or equivalent experience. The company says top experts earn over $3,000 a week.
Snorkel describes its clients as top labs, including Google, Mistral, and Anthropic. The Silicon Valley-based startup. founded in 2019 by Stanford researchers. says it creates datasets to improve AI models and creates tests for AI companies’ chatbots. Snorkel raised $100 million in Series D funding at a $1.3 billion valuation in May 2025. The company cut 13% of its workforce in September, Business Insider previously reported.
The job itself is structured like engineering work, not abstract labeling. Project “Marlin” instructed contractors to create scenarios for which software developers may use Claude Code. Contractors were told to select a GitHub repository from a list of thousands of repositories. then create a Pull Request—where a developer proposes changes such as new features or bug fixes. They also had to create a prompt: a set of questions explaining what is expected of the model.
In one task. a contractor prompted the model to reorganize how the system stores and handles “execution metadata. ” described as extra information about how things are run. The goal was to make the code clearer and easier for developers to work with without changing how anything about the product or feature actually works.
The model returned two sets of code. The contractor then chose which output they thought was more efficient. Contractors were also instructed to give follow-up prompts to “test how models handle conversation context,” according to the project directions.
Another task focused on security. The contractor prompted the model for a security fix tied to how MLFlow. an open-source machine learning platform. downloads Python packages when loading certain models. The task’s instructions required an evaluation of production-ready code based on correctness, security, reliability, and maintainability. The fix was also required to “properly block command injection attempts while still allowing all legitimate whitelisted pip options.”.
None of this work ends in a finished PR on a developer’s desk. The process runs through evaluation loops, selection steps, and approvals—so the model can be pushed closer to what the prompt expects and what a developer would accept.
The industry’s contractor-heavy approach spans beyond Snorkel. Startups including Scale AI. Mercor. and Handshake are among the platforms that pay hundreds of thousands of contractors around the world to filter. rank. and train AI responses for major tech companies. The work. as described in the source material. helps improve systems used in areas ranging from self-driving cars to chatbots built by companies such as OpenAI and Meta.
Neither Anthropic nor Snorkel responded to requests for comment from Business Insider.
In the end, Claude Code’s latest gains arrive with a familiar modern twist: a software tool that improves in the lab—and sharpens through the work of people whose names rarely make it into release notes.
Anthropic Claude Code Snorkel AI Marlin data labeling prompt engineering software engineers A/B testing MLFlow security fix GitHub pull request human feedback AI coding tools Scale AI Mercor Handshake
So it’s basically just crowd-sourcing code again?
280 per task for an hour is wild. I read that as “they’re paying 280 bucks” like it’s nothing, and meanwhile people are acting like the AI is doing all the work. Also why don’t they know which model version??? That seems sketchy.
Wait, “Marlin” is the name of the contractors project? I thought it was like a new programming language or something lol. If they’re A/B testing outputs, isn’t that just humans picking favorites? That’s not really “training” so much as… choosing the cleaner answers?
This is exactly why I don’t trust AI code tools. If 1,000 engineers are feeding it prompts and checking results, then it’s just stealing work but with extra steps. And they say they made it write “simplified” maintainable code… well yeah, because humans probably keep rejecting the messy stuff. Also $280 a task? That’s gonna attract the cheapest “engineers” or whatever, not the best. Idk, feels like a bait-and-switch for vibe-coding.