The Danger of Optimizing Only for Success

Optimizing for success is the fastest way to build a stupid system.

That’s the central insight from Internalizing Agency from Reflective Experience (Ge et al., 2026). The paper shows that standard outcome-driven reinforcement learning—where models are rewarded only for correct answers—leads to “distribution sharpening.” In simple terms: the agent gets better at winning, but at the cost of creativity and exploration. It converges on known paths and stops trying new ones.

The fix? LEAFE—Learning from Alternative Futures via Exploration. Instead of discarding failed trajectories, LEAFE makes them valuable. When an agent fails, it pauses. It backtracks to critical decision points, explores plausible alternatives, and generates reflections on what went wrong and what could have been different. These reflective experiences are then distilled into the model through supervised fine-tuning, effectively teaching it to learn from its own reasoning.

This shift—from outcome optimization to reflective learning—builds agents that adapt, question, and recover. They don’t just chase rewards; they understand them.

If you’re working on autonomous systems that need to operate in dynamic, unpredictable environments, this is the kind of intelligence that matters. See how it’s built—get early access at /early-access.

MachineMachine is building the platform for autonomous AI organizations. Early access →