Evidence-based guide to AI-assisted software development in production

AI Software Development

The promise of AI-powered software development is profoundly appealing: describe what you want, and watch the product materialize. However, as AI coding tools matured through 2025, research and real-life production experiences revealed a complex reality. Productivity gains are real, but they are highly conditional, unevenly distributed, and often absorbed by downstream bottlenecks.

As we reflect on how AI-augmented engineering evolved in 2025, let’s examine the findings from four prominent studies published last year, along with our own insights at AgileEngine. Looking at this data can shed light on how engineering teams can more effectively integrate AI agents and assistants into production workflows.

If you want the actionable advice immediately, scroll down to “Practical implementation guide,” but you will be missing out.

The research landscape

Study 1: METR’s controlled field experiment

METR (Model Evaluation & Threat Research), a non-profit focused on AI research, recently conducted a highly controlled randomized trial with surprising results. The trial tracked 16 experienced open-source developers as they completed 246 real-world tasks, randomly categorized as “AI-allowed” or “AI-disallowed.” The researchers then compared the developers’ performance across the two categories.

Now, for the surprising results: METR estimated that using AI tools slowed task completion down by 19%. The research also included predictions from different expert groups, and, needless to say, the expectation gap is significant:

24% faster task completion for “AI-allowed” tasks forecast by participating developers before the study
20% speed increase estimated by the same developers after the study completion
39% speedup expected by economists
38% speedup predicted by ML experts

Despite the small sample size, the study demonstrated strong internal validity through real-world tasks and highly experienced participants. The developers had an average 5-year exposure to the repositories they worked on, with an average of 1,500 prior commits. For the “AI-allowed” task category, they primarily used Cursor Pro with Claude Sonnet 3.5/3.7.

Study 2: Enterprise AI deployment at Tata 1mg

The results of a longitudinal study conducted at the HealthTech company Tata 1mg paint a more optimistic picture. During this year-long study, the company tracked 300 of its engineers as they adopted a custom-built AI tool for coding assistance and automated code review. The study data gathered by the company demonstrates growth across key productivity metrics:

31.8% reduction in pull requests (PR) review cycle time (attributed to the review agent, not just code generation)
60.1% increase in overall production code volume
Adoption quality: The top 30 adopters achieved a 61% increase in the quantity of shipped code, while the bottom 30 adopters saw an 11% decline
Seniority impact: Junior engineers (SDE1) demonstrated a 77% productivity increase while mid-level and senior engineers gained ~45%

Acceptance rates stabilized at 35–38% despite massive growth in code generation volume

Study 3: Anthropic’s internal usage analysis

Anthropic, the company behind the Claude LLMs, regularly publishes results from its internal experiments and studies. In one such study, it analyzed 200,000 Claude Code transcripts, surveyed 132 engineers, and conducted 53 interviews to better understand AI adoption patterns and their impact on real-world workflows. Here are the key findings:

Self-reported gains: Engineers reported a 50% increase in productivity (up from 20% the prior year).
New work: 27% of AI-assisted work represented tasks that wouldn’t have been done otherwise (e.g., fixing papercuts, ad-hoc tooling).
Verification: Most engineers could fully delegate 0–20% of their work.
Autonomy: Consecutive autonomous tool calls doubled (9.8 to 21.2) in six months, indicating AI is handling longer chains of logic.
Task complexity: Average complexity handled increased from 3.2 to 3.8 (on a 1–5 scale).

Study 4: Faros AI’s telemetry analysis

In a study that dwarfs Anthropic’s and Tata 1mg’s in terms of sample size, Faros AI analyzed telemetry from 1,255 teams and more than 10,000 developers. This dataset reveals what the company calls an “AI productivity paradox”: while AI accelerates individual performance, progress often stalls at the organizational level.

Throughput: Developers using AI completed 21% more tasks and merged 98% more PRs.
The bottleneck: PR review time increased by 91%.
Quality signals: Average PR size skyrocketed by 154% while bugs per developer increased by 9%.
Organizational impact: Correlations between AI adoption and company-wide delivery metrics (DORA metrics) were weak or nonexistent.

How does this align with our experience at AgileEngine?

The four studies above differ in key findings but converge on many essential points. Developers report feeling more productive when using AI to write, understand, and debug code. Measurable gains also seem higher for junior engineers and professionals venturing outside their primary area of expertise (e.g., “citizen developers” generating code or engineers becoming more “full-stack”).

These points generally align with our findings from our tracking of AI use at AgileEngine. Some of the most prominent AI adoption outcomes and anecdotal findings:

The 50% benchmark: Engineers and PMs consistently reported self-estimated 50%+ reductions in development time, with gains primarily in the coding phase.
Architectural validation: In one case, AI enabled a non-expert to validate a service extraction from a monolith “in less than half the time,” after which the project transitioned to domain experts for production implementation.
Role expansion: PMs used Claude Code to build functional UIs for demos and pull from engineering backlogs, rather than waiting for engineering capacity to become available.
Root-cause analysis speedup: Sourcegraph Deep Search identified the root cause of a bug in 2 minutes — the same job could’ve taken up to 4 hours if done manually. Cursor then helped fix the bug, write tests, and update the changelog — all in 10 minutes.
Model efficiency gap: One comparison showed that Claude Opus 4 completed a task in 5 minutes (2 prompts), whereas Sonnet 4 took 1 hour (8 prompts).

Though anecdotal, these points likely reflect the outcomes most engineering teams are seeing. While some of the high hopes surrounding AI are 100% justified, the technology requires a strategic approach to meaningfully drive productivity.

The productivity paradox: why coding speed ≠ delivery speed

The most critical insight for engineering leadership is the disconnect between coding speed and delivery velocity. The Faros AI data illustrates this vividly:

Metric	The “AI effect”	Implication
Code generation	+98% PRs merged	Developers are writing much faster
Code complexity	+154% PR size	The units of work are getting larger and harder to parse
The bottleneck	+91% review time	Humans cannot keep up with the speed of machine code generation
Quality	+9% bugs	Faster code generation correlates with more defects
Net result	Flat DORA metrics	Near-zero organizational acceleration

What’s our takeaway here? AI tools shift the constraint from authoring code to reviewing and testing. If you equip developers with AI without upgrading your review pipeline, you will create a backlog of unreviewed code and increase technical debt.

Reconciling the contradictions: context is king

Why did METR record a 19% slowdown, while Tata 1mg found a 60% volume increase, and Faros AI reports a 98% growth in merged PRs? The difference lies in task familiarity and repository complexity.

The “expert penalty” (METR study)

The METR study participants were experts working on old codebases they had maintained for 5 years. As a result, there likely was a lot of subtle context that was unavailable to AI assistants because it was’t written in the code.

This context gap is likely what slowed them down. The study participants had to spend more time refining their prompts, reading verbose code, and debugging subtle errors. Think of it as making the AI do the proverbial 20% of work that requires 80% of effort — and a deep tacit knowledge of the codebase. Doing this work themselves would have been faster.

The study identified several factors contributing to the slowdown:

Implicit retained knowledge: Developers had important context they could’t easily convey to AI.
Repository size and complexity: AI performed worse in complex environments with extensive implicit context.
AI overoptimism: Developers used AI even when it wasn’t helpful, driven by overconfidence.
Low reliability: Developers accepted less than 44% of the AI output.
High developer familiarity: The more familiar developers were with their codebase, the greater the slowdown.

The generalist advantage (Tata 1mg & Anthropic studies)

Tata 1mg & Anthropic studies examined broader, more diverse engineering populations working on varied tasks, often outside their core specialties.

Why they benefited from AI: A backend engineer building a React frontend or a junior developer navigating a new codebase benefits immensely from using AI to bridge knowledge gaps. According to an Anthropic backend engineer tasked with building a complex UI, “it did a way better job than I ever would’ve. I would not have been able to do it, definitely not on time.”

What this all tells us is that AI acts as a floor-raiser, not necessarily a ceiling-raiser. It helps you do things you are bad at, but it may slow you down on things you are already elite at.

The paradox of supervision

Another important concept that emerged from Anthropic’s research is what the company calls the “paradox of supervision.” Essentially, using AI for coding requires human supervision, but supervising AI requires the very coding skills that may atrophy from overusing AI assistance.

An Anthropic engineer articulated this clearly:

“I worry much more about the oversight and supervision problem than I do about my skill set specifically… having my skills atrophy or fail to develop is primarily gonna be problematic with respect to my ability to safely use AI for the tasks that I care about versus my ability to independently do those tasks.”

This has profound implications:

AI needs experienced users to be effective. Senior engineers who “developed that experience the hard way” can leverage AI effectively because they know what good code looks like.
Skill development requires deliberate practice. Some engineers combat this by deliberately practicing without AI: “Every once in a while, even if I know that Claude can nail a problem, I will not ask it to. It helps me keep myself sharp.”
Junior developers face the greatest risk. While juniors see the most significant productivity gains (77% vs 45%), they’re also most vulnerable to stunted skill development.

Skill development risks

The junior developer’s dilemma

When leveraging AI, engineers can gain a lot by following a principle integral to the Eames Office (and often quoted by its founders, Charles and Ray Eames). This principle is “Never delegate understanding,” and it is vital for junior developers.

The evidence is clear that junior developers benefit the most from AI augmentation in the short term. But this comes with a huge issue: junior engineers using AI miss the “collateral learning” that happens during hands-on problem-solving. Delegating problem-solving to tools like Claude means engineers spend less time learning things that aren’t directly useful for solving the problem, but are critical to engineers’ professional development.

The apprenticeship crisis

With AI, one more critical learning venue is at risk. Matt Beane, PhD, Associate Professor at UC Santa Barbara, has studied intelligent automation and skill development across more than 30 occupations. His research shows that traditional apprenticeship models erode when experts can perform small, routine tasks with AI rather than assigning them to juniors.

Senior developers do more than review pull requests; they teach juniors how to think architecturally. Pair programming is important for the juniors’ growth, as it transmits tacit knowledge that’s hard to document. As engineers increasingly work with AI, fewer opportunities remain for the hands-on collaboration essential to helping novices become professionals.

Code ownership and responsibility

The code that an LLM outputs is code you sign with your name when you commit it to git. There is no “Gemini did it” or “Claude did it” — there is only “I did it.” This creates a review burden, as reflected in some of the studies we examined above.

When leveraging AI, METR study participants spent 9% of their time reviewing and cleaning LLM output. Three-quarters of participants reported reading every line of AI-generated code, and 56% said they needed to make major changes to clean it up before accepting.

In both the METR and Tata 1mg studies, developers rejected more code than they accepted (<44% in METR, 35–38% at Tata 1mg). Essentially, most AI-generated code doesn’t meet production standards and requires careful inspections by a professional, which brings us back to the paradox of supervision.

Lessons learned from scaling AI impact at AgileEngine

As a software engineering company, AI augmentation has been transformative for both internal and client-facing workflows at AgileEngine. Based on our experience adopting AI, we’ve documented several workflow patterns that apply to most software development projects:

AI tools orchestration. Remember the earlier case of identifying a bug’s root cause with Sourcegraph and then handing it over to Cursor for bug fixing, writing tests, and updating the changelog? Using specialized tools to their strengths, with the developer managing handoffs, outperforms expecting any single tool to handle end-to-end workflows.
Shift from “author” to “architect.” We see benefits in a structured workflow where humans set direction while AI accelerates execution. Teams that achievesee the most gains enforce a “review-before-generate” pattern: AI proposes plans, and humans validate them before coding begins.
Don’t trust, verify. Successful teams instruct agents to ensure 80%+ test coverage as part of the agentic code-generation task, not as an afterthought. AI is also useful for generating end-to-end tests, supporting PR testing, and potentially reducing the workload on dedicated AQA engineers.
Structured validation for larger initiatives. AI can be extremely helpful when validating architectural decisions in a step-by-step process. Here’s what these steps looked like in a recent project focused on the delivery of quick spikes and PoCs:

AI-assisted review of the related documents and identification of possible approaches
Detailing of the promising approaches and AI-assisted documentation of all key decisions
Review of AI-generated coding plans
Step-by-step implementation of the plan, with Git commits between steps
Review of the changes by commit while ensuring unit test coverage
Validation of the resulting changes via integration and manual end-to-end tests

A few more insights from the Faros AI study

Learning from other people’s experience is often better than learning from your own, so let’s look at the insights from Faros AI. The three points focus on the characteristics of organizations that successfully achieved strategic productivity growth via AI augmentation:

Data-driven decision making. Top performers instrument the full development lifecycle — tracking throughput, flow efficiency, test coverage, and code quality. This observability establishes a pre-AI baseline and reveals where AI accelerates value or stalls it.
Strong platform foundations. High-performing organizations treat AI enablement as a product, with platform teams building centralized prompt libraries, managing model deployment, and supporting telemetry integration. This also includes robust CI/CD pipelines, testing infrastructures, and review processes.
AI-first mindset. Top-performing companies also leverage AI as a catalyst for structural change. They explicitly define where AI should be applied, set usage expectations by role, and embed AI training into onboarding and workflows.

Practical implementation guide

Moving from organisation-level principles to practical recommendations, here are specific technical strategies for production environments based on our experience.

1. Code review and PR culture

The Faros data (namely, 154% larger PRs and 91% longer review times) proves that when used unsystematically, AI produces a backlog of unreviewed code. PRs serve a dual purpose in this respect: they’re both Pull Requests and Peer Reviews. They force you to write code you won’t be ashamed to show your colleagues. If you’re a sole developer on a project, it’s dangerously easy to slip into “vibe-coding” mode, where you don’t check generated code nearly as rigorously as you should.

Here are a few recommendations to be mindful of:

Maintain production code standards: AI-generated code should pass the same review bar as human-written code. Professional engineers read every line of AI-generated code before committing. Passing tests and appearing to work correctly doesn’t mean the code is actually good.
Enforce atomic PRs via tooling: AI output tends to be verbose, so humans must enforce brevity. Use automated rule enforcement tools to warn on PRs that exceed a predefined size (e.g., over 400 lines of code).
AI-augmented reviews: Use code review assistants for “pre-flight” checks. AI is better at catching syntax errors, typing issues, and test coverage gaps than humans, so you can reserve human cognitive load for architectural reviews.
Context-aware commit messages: Demand high-quality AI-generated summaries for every PR to help reviewers navigate them more quickly.

2. Task selection

Teach your team to triage tasks based on the METR and Anthropic findings:

Task type	AI suitability	Reason
Boilerplate / CRUD	🟢 High	High validation speed, low creation cost
New tech stack	🟢 High	Bridges knowledge gaps (e.g., a Python dev writing Go code)
QoL improvements / Ad-hoc tools	🟢 High	Low risk; wouldn’t be done otherwise
Complex refactoring	🟡 Medium	AI struggles with broad context; requires heavy review
Core architecture	🔴 Low	The tacit knowledge requirement is high
Critical security path	🔴 Low	AI can be “smart in dangerous ways” (e.g., by creating problematic solutions that only experienced users can recognise as such)

The key insight here is that it makes the most sense to use AI more heavily when validation is cheap relative to creation. At the same time, it’s generally wise to limit or avoid AI use for complex tasks that require familiarity, or when review overhead exceeds the time and effort required to write the code.

3. Mitigating skill atrophy and the “junior dilemma”

“Explain this” requirement: Engineers must always be able to explain why the AI’s solution works during code review. As a result, one can never unquestioningly accept AI-generated code.
AI supervision apprenticeships. Taking the previous suggestion one step further, teams can turn pair programming sessions into “trio programming,” where seniors coach juniors in validating and refining AI suggestions. In this model, juniors can use AI to attempt solutions, and then explain their approach and the AI’s output to seniors. This way, juniors can develop domain knowledge and learn to effectively supervise AI, while seniors can dedicate saved time to mentorship.
Unassisted hours: Consider designating specific tasks, code areas, or time blocks for unassisted development. Encouraging “unassisted hours” of this kind can be hugely beneficial for learning and skills development.
Documentation and information search: AI summaries can be invaluable, but they are not a substitute for reading official documentation. In the METR study, developers spent less time actively searching for information when handling “AI-allowed” tasks, a dangerous trend given that AI output can prove unreliable.
Knowledge sharing between humans. If you’re a manager, create opportunities for junior developers to work alongside their senior colleagues. As companies adopt AI tools, collaborations of this kind become less frequent, yet they remain critical for professional development.

4. Git hygiene and code ownership

Make preliminary commits. When working with LLMs, commit early and often to your local branch. If your code is in good shape, commit it. You can squash commits and clean up git history before pushing to remote, but if your agent backs itself into a corner, it might reset all your local changes.
Don’t allow AI to mess with Git, ever. Restrict AI tools from executing git commands. Let humans manage version control explicitly.
No “blind merges.” Developers reject almost two-thirds of AI output, often because generated code does not meet production standards. Again, accepting AI code without a thorough review is negligence.
You sign every LOC. As the submitter, you own the bugs. “The AI wrote it” is never a valid defense during an incident. The code that an LLM outputs is code you sign with your name when you commit it.

5. Context management for AI sessions

Re-sync after manual edits. If you work turn by turn, alternating between manual edits and your agent, ask the AI to re-read the files you’ve changed. Many current tools won’t automatically know about your manual changes.
Don’t fear fresh sessions. Long sessions degrade in quality, cost increasingly more tokens (and money), and become less productive over time. Starting fresh with a clean context often produces better results.
Invest in reusable prompts and guidelines. Since you’ll be starting fresh sessions regularly, prepare accordingly. Write prompt libraries and agent instructions. Share them with your team. Put them in version control.

When it comes to context management, standardized formats become invaluable. Check out agents.md, an open format for guiding coding agents that’s already used by over 60,000 open-source projects. Think of AGENTS.md as a README for agents: a dedicated, predictable place to provide context and instructions. Many AI tools — including Cursor, Windsurf, Codex, GitHub Copilot’s coding agent, and others — can automatically read these guidelines.

Beyond AGENTS.md, many AI coding tools also allow you to save workflows, custom instructions, or project-specific rules that persist across sessions. Use these features to encode your team’s conventions, code style preferences, and project-specific context. This investment pays off quickly, especially when you’re starting fresh sessions regularly.

Key takeaways and closing thoughts

The Productivity Paradox is real. Team-level gains (21% more tasks, 98% more PRs) don’t automatically translate to organizational impact. With AI, coding is faster, but delivery often isn’t. Don’t measure productivity by lines of code because AI tends to inflate LOCs and PR counts. Instead, measure cycle time and change failure rate.
Context determines impact. AI slows down experts on familiar codebases (19% slowdown) while helping generalists tackle unfamiliar domains much faster (77% for junior devs). AI is a floor-raiser, but it’s not necessarily a ceiling-raiser.
Adoption quality matters more than quantity. Heavy users who use AI well see 61% gains; heavy users who use it poorly experience a 11% decline. Track whether AI usage is actually improving outcomes.
Human review is now the bottleneck. PR review time increased 91% even as PR volume doubled. Organizations must invest in review processes alongside AI tools. Without improving review, testing, and deployment, AI’s benefits are absorbed by system bottlenecks. Remember that, according to Amdahl’s Law, the system’s speed is constrained by its slowest link.
The supervision paradox is real. Using AI effectively requires skills that can atrophy from AI overuse. You need senior engineers to supervise AI, but if juniors rely solely on AI, they will never become senior experts capable of such supervision. To mitigate this issue, you need to proactively invest in mentorship and skill acquisition.
Junior developers face a dilemma. While seeing the biggest short-term gains (77%), they face the highest long-term development risks. “Never delegate understanding” is probably the most important principle for junior developers right now.
Review standards can’t slip. Most AI code is rejected or significantly modified. Larger PRs (+154%) mean more bugs (+9%), and with more bugs, automated testing is no longer optional.
Expect a “J-curve.” Productivity may dip initially as teams learn when to use the tools and when to ignore them. The Tata 1mg study showed gains stabilized and continued to grow over 6 months.
Invest in reusable context. Use standardized formats like agents.md and your tool’s custom instruction features to encode conventions and reduce session startup costs.

AI is not magic — it’s more like a power tool. When handled by a novice without supervision, it can be useless — or outright dangerous. But in the hands of an expert with the right process, it is a force multiplier.

Need expert support to optimize your engineering process to maximize the gains? By leveraging our expert consultancy and AI-assisted engineering services, companies can build top-notch digital products up to five times faster. Book a call with our experts to explore optimal strategies for AI adoption and product development.