Engineering Metrics Anti-Patterns

The problem with measuring engineering isn't that it's impossible. It's that most organizations reach for the metrics that are easy to collect rather than the ones that are useful. Lines of code, story points, PR count per week, hours online. These are all measurable. They're also all gaming traps, and the teams that figure that out early become noticeably better than the ones that don't.

Most engineering metrics do more harm than good. I'll defend that. A metric that can be gamed without improving the underlying outcome isn't a neutral measurement tool. It's an incentive structure that shapes behavior, and if the metric doesn't map to what you actually care about, that incentive structure is actively working against you. Engineers are smart people. They will optimize for whatever you measure, and if what you measure is easy story points, you'll get lots of easy story points. Understanding why these metrics fail matters as much as knowing what to replace them with.

Story Points: Velocity Theater

Story points were designed as a planning tool -- a relative measure of effort that helps teams forecast how much work they can take on in a sprint. They were never designed as a performance metric, and Agile's creators have been consistent about this. When organizations start comparing velocity across teams, treating it as a productivity KPI, or using it to justify headcount decisions, two things reliably happen.

First, point inflation. Teams learn that high-velocity sprints are rewarded, so estimates drift upward. The 1-point card from last year is now a 3-point card, not because the work got harder but because the metric started mattering. This happens gradually and often without conscious manipulation.

Second, quality degradation. Teams optimize for closing cards, not for the quality of what they ship. Reviews become rubber stamps. Testing gets rushed. Technical debt accumulates faster than it would in a team optimizing for different signals. The velocity number looks good. The system degrades.

The specific harm of velocity comparison across teams is even more toxic. Teams working on complex, high-uncertainty work -- infrastructure refactoring, architectural changes, debugging hard bugs -- will always produce fewer points than teams doing greenfield feature work. Comparing them on velocity rewards the wrong kind of work and punishes teams that are solving the hardest problems.

Velocity in story points is almost always useless as a performance signal. If it shows up in your metrics, the right question isn't how to improve it. It's whether you should be measuring it at all.

Lines of Code: Measuring the Wrong Output

Lines of code as a productivity measure has been ridiculed since at least the 1980s. Bill Gates noted that measuring software productivity by lines of code is like measuring aircraft manufacturing progress by weight. The observation is correct. Yet engineering dashboards at some organizations still include commit counts, file change counts, and "code output" metrics that are essentially proxy measures for LOC.

The incentive this creates: engineers write more code than necessary. Refactors that would reduce a 500-line module to 200 lines don't register as positive on LOC metrics. Clever algorithms that replace brute-force approaches look bad. The engineer who deletes 2,000 lines of dead code and improves system reliability has produced negative value by this measure. Some of the best engineering work I've seen involved deleting code, not writing it.

The only output that matters in software is working software that solves real problems reliably. That has essentially zero correlation with lines of code.

PR Count and Merge Rate

PR count as a productivity signal incentivizes small, easy PRs over well-structured, appropriately-sized changes. Engineers gaming this metric produce a stream of trivial PRs -- fix a typo, update a config, rename a variable -- that consume reviewer attention without delivering proportional value.

The more insidious version is merge rate: how many PRs does the team merge per week? This metric causes reviewers to approve PRs quickly rather than carefully. The review process, one of the highest-value quality gates in software engineering, gets degraded because the incentive structure rewards throughput over thoroughness.

There's a legitimate version of PR cycle time as a metric -- how long does a PR spend open before merging -- that's worth tracking as a developer experience signal. A median PR cycle time of four hours is fine. A P95 of five days tells you there's a code review bottleneck somewhere. But cycle time is about surfacing friction, not rewarding volume.

Code Coverage as a Correctness Proxy

Code coverage -- what percentage of your codebase is executed by the test suite -- is a useful minimum threshold, not a quality measure. A codebase with 90% coverage can still have catastrophic bugs if the tests are poorly written. The metric incentivizes writing tests that execute code paths without making meaningful assertions about correctness.

The most common anti-pattern: an engineer adds a test that calls a function and asserts that it doesn't throw an exception. The function is now "covered." The test adds no protection against regressions in the function's actual behavior.

Coverage thresholds create another problem: they're easy to game and produce perverse incentives around the marginal test. Engineers writing tests to hit a coverage number write different tests than engineers writing tests to prevent regressions in behavior they care about. You want the second kind of engineer.

DORA Metrics: The Best Starting Point That Can Also Be Gamed

DORA metrics -- deployment frequency, lead time for changes, change failure rate, and recovery time -- are the best-validated quantitative measures of software delivery performance. They're correlated with actual business outcomes in a way that the anti-pattern metrics are not. The research is longitudinal, broad, and peer-reviewed. Start here.

They can also be gamed. I've seen teams inflate deployment frequency by splitting deployments into trivially small changes that don't represent independent units of value. I've seen change failure rate look good because the team classifies incidents conservatively. I've seen lead time reported from "code review approved" to deployment rather than from first commit to deployment, which understates the actual pipeline friction.

DORA metrics are useful because they're directionally right, not because they're precise. Use them as navigation. Don't treat them as scorekeeping. A metric that doesn't trigger an action when it moves is a metric you should sunset.

What to Measure Instead

For team health, the SPACE framework adds dimensions that DORA misses: engineer satisfaction, collaboration quality, and cognitive load. These require surveys and qualitative input. Quarterly targeted surveys of 5-8 questions about specific friction points produce more signal than annual all-hands NPS questions. Ask specific questions: "How confident are you that your changes won't break something unrelated?" produces signal. "Is the developer experience good?" produces noise. (For the full operating model that sits on top of DORA plus SPACE, our developer experience scorecard is the companion playbook.)

For individual performance evaluation -- which is different from team health measurement -- the most useful signals are scope of impact, quality of technical decision-making over time, effectiveness as a collaborator, and how the team around them improves. These require judgment and conversation, not dashboards. The instinct to make individual performance measurement more quantitative is understandable and counterproductive. The dimensions that matter most for a senior engineer don't fit in a spreadsheet.

The principle that should guide metric selection: track the things you actually care about, and resist the pressure to make the measurement simpler than the reality. The teams that get this right stop measuring what's easy to measure and start measuring what actually predicts whether software is getting better. A system that looks healthy on three bad metrics is not a healthy system. A team that's improving on the four DORA dimensions and reports high satisfaction in surveys is almost certainly getting better at shipping software that works. That's the target state.