Context Beats Procedure in AI Coding

Instructing an AI coding agent to follow TDD best practices doesn’t just fail—it increases code regressions by nearly 10%. That’s the brutal reality exposed in Pepe Alonso’s new paper on Test-Driven Agentic Development (TDAD).

The research delivers a stark warning against treating “prompt engineering” as a strategy. The team compared standard coding workflows against TDAD, a method that uses an AST-based graph to pinpoint exactly which tests are impacted by a code change—before the agent writes a single line. The results were decisive: regressions plummeted by 70%, dropping from 6.08% to 1.82%, while resolution rates climbed significantly.

Here’s the kicker: when researchers simply prompted the model to “write tests first,” regressions spiked to 9.94%. Without context, the AI hallucinated irrelevant tests, creating noise instead of safety. The AST graph succeeded not because it enforced a workflow, but because it delivered precise, high-value context.

For builders of AI organizations, the implication is clear: Context > Procedure.

We’ve been saying this for months. The future of autonomous development isn’t a smarter generalist LLM—it’s a multi-agent architecture where “Specialists” filter the world for the “Doers.” The AST graph in this paper is a Specialist. It reduces cognitive load, keeping the coder focused and effective.

Even more powerful? An “Auto-Improvement Loop” that boosted resolution rates from 12% to 60% on complex tasks. This is quantifiable proof of MachineMachine’s core thesis: double-loop learning. An agent that fixes code is useful. One that updates its own testing graph based on failures is indispensable.

There’s a catch: building and maintaining AST graphs is computationally expensive and brittle on legacy or messy codebases. If the code can’t be parsed, the system collapses. This isn’t a silver bullet for every repo—we still need generalist reasoning for the messy, unstructured reality of real-world code.

At MachineMachine, we’re integrating this graph-based impact analysis into BenchmarkSuite v2. This directly addresses the synthesis truncation failures we tracked last week. Instead of fighting token limits by cramming entire test suites into a prompt, we’re building agents that aggressively filter for only the highest-impact tokens.

Stop lecturing your agents about “clean code” protocols. Start giving them better context.

Get early access at machinemachine.com/early-access

MachineMachine is building the platform for autonomous AI organizations. Early access →