Developer Creativity vs Productivity in the AI Era

Q: Why does creativity matter more than productivity in software engineering now?

Because AI made the productive part of the job cheap. When an assistant can write the function, the test, and the glue code in seconds, the speed of producing code stops being a scarce resource, and scarce resources are what differentiate people. What AI cannot supply is the judgment about which code is worth writing, the taste to keep a system coherent, and the creativity to reframe a problem so most of the work disappears. Productivity is now the baseline everyone shares; creativity is the part that still separates a valuable engineer from an expensive one.

Q: How should you measure software engineering effectiveness with AI?

Measure outcomes and judgment, not motion. Effectiveness is whether the right problem got solved, whether the system stayed comprehensible, and whether the work created leverage for other people, not how many commits or story points were logged. The SPACE framework from Microsoft Research is useful here precisely because it refuses to reduce productivity to a single activity number. The honest version of measurement asks "did we build the right thing, and is the system better for it?" rather than "how fast did the keyboard move?"

Q: Is developer productivity still important with AI?

Yes, but it has changed from a differentiator into a baseline. Shipping working software efficiently still matters the way being able to swim matters in a triathlon: it is table stakes, not the thing that wins the race. Because AI raises everyone's throughput at once, productivity gains compress into a band where almost everyone looks fast. The competitive question moves up a level: given that everyone can now produce code quickly, who chooses the right things to produce? That is where the remaining advantage lives.

Q: What skills make a developer valuable when AI writes the code?

Problem selection, system taste, and judgment under ambiguity. Problem selection is knowing what not to build, which becomes more valuable as the cost of building drops. Taste is keeping a codebase coherent when an assistant will happily generate ten plausible-but-divergent ways to do the same thing. Judgment is deciding when "good enough" is genuinely good enough and when it is a trap. None of these are typing skills, and none of them are things you can hand to a model, which is exactly why they hold their value.

Q: What replaces velocity metrics for engineers in the AI era?

Replace velocity with leverage and decision quality. Track problems reframed or removed, abstractions killed before they metastasized, and the work that other people no longer had to do because of a senior engineer's call. These are harder to count than deployment frequency, which is precisely why they resist gaming. A velocity number tells you the team moved; a leverage signal tells you the team moved in a direction worth moving. Goodhart's Law guarantees that any easy-to-game number will eventually be gamed, so the metrics worth keeping are the ones that are inconvenient to fake.

Thomas Prommer Technology Executive & CTO Connect on LinkedIn

Published: May 30, 2026

Updated: June 12, 2026

75% of enterprise engineers on AI code assistants by 2028 (Gartner)

<10% were using them in early 2023; the baseline moved fast

1975 the year Goodhart wrote the law that explains the rest

1 metric that still measures the thing once AI can game the others: judgment

Key Takeaways

Throughput became a commodity. — When an AI assistant can produce the function, the test, and the boilerplate in seconds, the speed of writing code stops being scarce. What stays scarce is deciding which function is worth writing at all.
Your scorecard is now measuring the wrong half. — Deployment frequency, story points, and lines shipped tracked the commodity part of the job. The moment AI can inflate every one of them, Goodhart's Law turns each into a target that stops measuring the thing.
Problem selection is the leverage. — The most expensive code is the code that should never have been written. AI lowers the cost of building, which raises the cost of building the wrong thing. Choosing what not to build is the skill that compounds.
Measure judgment, not motion. — Reward the engineer who deleted a roadmap item, killed a bad abstraction, or reframed the problem so half the work disappeared. None of that shows up on a velocity chart, and all of it is what you actually want more of.

A few years ago I set the engineering scorecard for a platform team I was running. We tracked the usual things: deployment frequency, lead time, story points burned, pull requests merged. It felt rigorous. The numbers went up and to the right, the dashboards were green, and for a while I believed the dashboards. Then I noticed that the most valuable engineer on the team was the one whose numbers were unremarkable. He did not merge the most PRs. What he did was walk into a planning meeting, listen for ten minutes, and say "we shouldn't build this." He was right often enough that he probably saved the company more money in deleted roadmap items than the rest of us earned in shipped ones. My scorecard had no column for him.

I have been turning that over ever since, and AI has turned a quiet doubt into a structural problem. When a coding assistant can produce the function, the test, and the boilerplate in seconds, the part of the job my scorecard measured just became cheap. Gartner's read on this is blunt: in its Software Engineering 2030 research, the firm argues that AI should amplify a developer's creativity rather than merely speed up delivery, because generative models cannot replicate the creativity, critical thinking, and problem-solving that engineers bring. I think that is the most important sentence anyone has written about our profession in a decade, and most engineering orgs are still measuring the opposite of it.

When productivity becomes a commodity

Here is the shift in one line: AI did not make engineers more productive so much as it made productivity ordinary. Gartner expects 75% of enterprise software engineers to be using AI code assistants by 2028, up from fewer than one in ten in early 2023. When a capability moves from rare to near-universal in five years, it stops being an edge. It becomes a floor. Everyone gets faster at roughly the same time, which means relative speed, the thing a velocity chart actually captures, compresses toward meaninglessness.

This is the position worth sitting with. Productivity has not become unimportant. It has become a commodity. And the rule with commodities is that you do not win by producing more of them; you win by deciding what to do with the cheap supply. Cement is a commodity. Nobody differentiates a building on the price of cement. They differentiate it on what the architect chose to do once the cement got cheap.

I keep reaching for the endurance comparison because it is the cleanest one I know. In a triathlon, everyone in the field can swim, bike, and run. Raw fitness is the entry fee, not the advantage. The athletes who win the close races win on pacing judgment: knowing when to spend and when to hold, reading the day's conditions, choosing which surge to answer and which to ignore. Hand every competitor a faster bike and you have not changed who wins; you have just raised the floor and made judgment matter more, because now the race is decided by smaller margins that fitness alone can no longer cover. AI is the faster bike. It does not settle the race. It moves the decisive skill up a level, to the part of the work that was always about choices rather than output.

The principle: Goodhart's Law

There is a name for what goes wrong when you keep optimizing the old number into this new world. Charles Goodhart wrote it in 1975: when a measure becomes a target, it ceases to be a good measure. Velocity metrics were always a little vulnerable to this. A team that gets graded on story points learns to inflate story points. But the vulnerability was tolerable because gaming the number took real effort. AI removed the effort.

Think about what an assistant does to a velocity scorecard. Commits per week? Trivial to inflate. Lines of code? An assistant will generate a thousand lines as happily as ten. Pull requests merged? Spin them up faster than anyone can review them. Deployment frequency? Ship the cheap, safe, low-value changes that move the counter and skip the hard, ambiguous work that does not. Every metric we used as a proxy for "this engineer is doing valuable work" is now a metric an assistant can satisfy without the value ever showing up. The number stopped measuring the thing the instant the thing got cheap to fake.

This is why I no longer trust a green velocity dashboard the way I used to. A team can look maximally productive on every quantitative axis while shipping a quarter of work that should have been one deleted ticket and a conversation. The dashboard is not lying about activity. It is lying about value, because the link between the two was severed the moment AI could supply activity on demand.

What stays scarce

If throughput is the commodity, the question becomes: what did not get commoditized? Three things, in my experience, and none of them fit in a counter.

The first is problem selection. The most expensive code in any company is the code that should never have been written. It has to be maintained, secured, migrated, and reasoned around forever, and it earns nothing. AI lowered the cost of building, which mechanically raised the relative cost of building the wrong thing. The engineer who kills a feature in planning is now worth more than the engineer who ships it flawlessly, because the flawless shipping is the part a model can help with and the killing is not. My unremarkable-numbers engineer was doing the highest-leverage work on the team, and my scorecard was structurally blind to it.

The second is taste. An assistant will cheerfully generate ten plausible, divergent ways to solve the same problem, and it has no opinion about which one keeps your system coherent a year from now. Coherence is a human judgment about a thousand small consistencies that no individual decision makes obvious. The engineer who keeps a codebase legible while ten people and three models all contribute to it is doing something AI actively makes harder, not easier. Taste is the counterweight to generation, and generation just got free.

The third is judgment under ambiguity: knowing when "good enough" is genuinely good enough and when it is a trap, when to spend a week on resilience and when to ship and watch. This is the part of senior engineering that was never really about code. It was about reading a situation. AI does not read situations. It completes patterns.

What I measure and reward instead

So I stopped grading motion and started trying to grade judgment, which is harder and less tidy and far more honest. I do not have a clean four-metric dashboard to hand you, and I have grown suspicious of anyone who does. But the direction is clear enough to act on.

There is a second reason to distrust velocity, separate from Goodhart, and Will Larson states it cleanly in his systems model of LLM adoption. Model the delivery pipeline as a flow — tickets moving through coding, testing, and deployment, with errors running back from each stage — and the math says the throughput dial barely matters: "the constraint on this system is errors discovered in production, and any technique that changes something else doesn't make much of an impact." So even an honest, ungameable velocity number would be the wrong thing to chase, because raising production speed is not what raises shipped value. Worse, as Larson puts it, "any approach that increases development velocity while also increasing production error rate is likely net-negative." Velocity is not just gameable; it is the wrong lever. The number that actually tracks the work is closer to the rate of errors a team keeps out of production over time, and the judgment that does that is exactly the part no model supplies.

Reward decisions avoided. The roadmap item that got deleted, the abstraction that got killed before it metastasized, the integration that someone argued the company out of needing: these are real, high-value outputs that produce no commits. I now ask leads, in review, what their team chose not to build this quarter and why. The answers are more diagnostic of engineering quality than any burndown chart I have ever read.

Reward leverage. The senior engineer's job is increasingly to create capacity for everyone around them: the design that let three teams stop duplicating work, the framing that turned a six-month project into a six-week one. Leverage is inconvenient to count, which, by Goodhart, is exactly what makes it worth tracking: a number that is hard to fake stays attached to the thing it measures. The SPACE framework from Microsoft Research is useful here for the same reason. It deliberately refuses to collapse productivity into one activity metric, because its authors understood that the single, convenient number is the one that breaks first.

And reward taste explicitly, even though it feels unrigorous to put it on a scorecard. I would rather promote the engineer whose systems other engineers find pleasant to work in than the one whose commit graph looks busiest. The first is creating durable value. The second may just be the person most willing to let a model type for them.

What creativity actually looks like in engineering

When people hear "creativity" in a software context they tend to picture something flashy: a clever algorithm, a novel architecture, a flash of inspiration. That is the least of it. The creativity that matters most in engineering is quieter and more consequential: it is the ability to see a problem from an angle nobody else in the room is standing at. The engineer who realizes the reporting feature three teams are scoping is really a caching problem, and solves it once instead of three times. The one who notices that the "we need a new service" conversation is actually a "we need to delete two services" conversation. This is creative work in the strict sense, generating a possibility that did not exist before someone thought of it, and it is precisely the kind of leap a pattern-completing model does not make, because the model can only recombine the framings it was given.

There is a reason this maps onto problem-framing rather than problem-solving. AI is extraordinary at solving a well-posed problem and useless at noticing that the problem is mis-posed. Posing the problem correctly is most of the value, and it has always been the part that separated the engineers I wanted to keep from the ones I could replace. AI has not changed that hierarchy. It has sharpened it, by automating the bottom of it and leaving the top exactly where it was. The work that survives commoditization is the work that decides what the work should be.

The trap of false objectivity

The hardest thing about moving from velocity to judgment is that velocity feels objective and judgment feels like opinion. A deployment-frequency number does not argue back. It sits on a dashboard looking like physics. Promotion committees love it precisely because it lets them avoid the uncomfortable work of forming a view about whether someone's contributions were any good. But that objectivity was always a costume. A number that is easy to produce and easy to defend is not the same as a number that is true, and the SPACE authors made exactly this point years before AI forced the issue: any single activity metric, treated as the measure of productivity, will mislead you about the thing you actually care about.

I would rather defend a subjective, well-reasoned judgment about an engineer's value than hide behind an objective number that measures the wrong quantity precisely. The second feels safer and is more wrong. A measurement system that optimizes for what is easy to count will reliably produce a team that is good at being counted, which is not the same as a team that is good. The discipline is to keep asking whether the number still points at the thing you wanted, and to be willing to throw the number out when it stops, because trusting the convenient metric over your own judgment is how you ship constantly and build the wrong product.

The uncomfortable part for engineering leaders

The move from measuring productivity to measuring judgment is uncomfortable precisely because productivity was easy and judgment is not. Velocity metrics gave us the comforting fiction that engineering effectiveness was countable, and we built whole performance-review systems on that fiction. Those systems are now optimizing for the one part of the job that stopped being scarce. Every engineer who games them with an assistant is behaving rationally; the incentive is broken, not the person.

I think the honest path forward is to accept that the most valuable engineering work has always resisted measurement, and that AI has finally made that resistance impossible to ignore. The organizations that keep rewarding velocity will fill up with people who produce a great deal of cheap, fast, gameable output. The ones that learn to recognize and reward creativity, taste, and problem selection will keep the engineers who do the part of the work no model can do. I have written more on where this leaves the discipline as a whole in my piece on the future of software engineering, and on why cheaper code does not mean less of it in Jevons paradox in software development. The structural counterpart to all of this is how you shape the org itself, which I get into in team topologies in the AI era. For the broader market view, the team at We The Flywheel covers it in the future of software development.

My old scorecard had no column for the engineer who said "we shouldn't build this." That was a flaw when I wrote it. Now it is a liability. The whole point of the job is moving up to where his work lives, and the measurement has to follow, because the number stopped measuring the thing, and pretending otherwise just rewards the people best at gaming what no longer matters.

Frequently Asked Questions

Why does creativity matter more than productivity in software engineering now?

Because AI made the productive part of the job cheap. When an assistant can write the function, the test, and the glue code in seconds, the speed of producing code stops being a scarce resource, and scarce resources are what differentiate people. What AI cannot supply is the judgment about which code is worth writing, the taste to keep a system coherent, and the creativity to reframe a problem so most of the work disappears. Productivity is now the baseline everyone shares; creativity is the part that still separates a valuable engineer from an expensive one.

How should you measure software engineering effectiveness with AI?

Measure outcomes and judgment, not motion. Effectiveness is whether the right problem got solved, whether the system stayed comprehensible, and whether the work created leverage for other people, not how many commits or story points were logged. The SPACE framework from Microsoft Research is useful here precisely because it refuses to reduce productivity to a single activity number. The honest version of measurement asks "did we build the right thing, and is the system better for it?" rather than "how fast did the keyboard move?"

Is developer productivity still important with AI?

Yes, but it has changed from a differentiator into a baseline. Shipping working software efficiently still matters the way being able to swim matters in a triathlon: it is table stakes, not the thing that wins the race. Because AI raises everyone's throughput at once, productivity gains compress into a band where almost everyone looks fast. The competitive question moves up a level: given that everyone can now produce code quickly, who chooses the right things to produce? That is where the remaining advantage lives.

What skills make a developer valuable when AI writes the code?

Problem selection, system taste, and judgment under ambiguity. Problem selection is knowing what not to build, which becomes more valuable as the cost of building drops. Taste is keeping a codebase coherent when an assistant will happily generate ten plausible-but-divergent ways to do the same thing. Judgment is deciding when "good enough" is genuinely good enough and when it is a trap. None of these are typing skills, and none of them are things you can hand to a model, which is exactly why they hold their value.

What replaces velocity metrics for engineers in the AI era?

Replace velocity with leverage and decision quality. Track problems reframed or removed, abstractions killed before they metastasized, and the work that other people no longer had to do because of a senior engineer's call. These are harder to count than deployment frequency, which is precisely why they resist gaming. A velocity number tells you the team moved; a leverage signal tells you the team moved in a direction worth moving. Goodhart's Law guarantees that any easy-to-game number will eventually be gamed, so the metrics worth keeping are the ones that are inconvenient to fake.

For CTOs & Tech Leaders

Need Expert Technology Guidance?

20+ years leading technology transformations. Get a technology executive's perspective on your biggest challenges.

Schedule Consultation View Tech Guides