Benchmaxxing Is the New Vanity Metric

Thomas Prommer Technology Executive & AI Strategy Consultant Connect on LinkedIn

Published: May 24, 2026

Updated: June 6, 2026

1,000 monthly searches for "bench maxing"

70 monthly searches for "benchmaxxing"

2 completely different meanings, same word

0 benchmarks that survive being optimized for

Key Takeaways

Two meanings, same pattern. — The gym bro chasing a bench press 1RM and the AI lab chasing an MMLU score are both benchmaxxing — optimizing for a visible number that has stopped measuring what it was supposed to measure.
The gym version is at least honest. — Nobody in the gym pretends a 1RM bench press proves functional strength. In AI, labs present gamed benchmark scores as proof of capability, and enterprise buyers make procurement decisions on them.
Goodhart's Law runs everything. — Lines of code. Story points. NPS scores. Bench press PRs. MMLU percentages. Token consumption counts. The pattern repeats because humans are wired to optimize whatever number is visible, and visible numbers are always gameable.
The vocabulary is the real contribution. — Tokenmaxxing names the input side of AI's measurement crisis. Benchmaxxing names the output side. Together they give the industry a way to talk about Goodhart failures without reaching for a 1975 economics paper every time.

Two Meanings, Same Pattern

If you search "bench maxing" today, you get two completely different worlds. The first is gym culture: forums, Reddit threads, and YouTube videos about maximizing your bench press one-rep max. The second, smaller but growing, is AI: researchers and engineers writing about the practice of optimizing models for leaderboard scores rather than for doing useful work. The word is spelled differently by each camp — single x for the lifters, double x for the "-maxxing" internet suffix — but they are describing the same behavior.

In both cases, someone is pursuing a visible number past the point where the number still measures the thing it was supposed to measure. The lifter who arches until his range of motion is three inches has technically benched 405, but the number has stopped telling you anything about his pressing strength. The lab that fine-tunes on MMLU-format data and cherry-picks checkpoints has technically scored 89.9%, but the number has stopped telling you anything about the model's capability on your tasks. Both are benchmaxxing. The metric became the target and stopped being a good metric.

Charles Goodhart wrote the law that explains this in 1975, and the internet has been generating new examples of it ever since.

The Gym Bro and the AI Lab

The parallel is not just metaphorical. The social dynamics are structurally identical.

In the gym, ego lifting is a known phenomenon. The lifter selects weight based on what the audience — or the Instagram post — requires rather than what produces useful training stimulus. The incentive is social signaling: looking strong in front of other lifters. The bench press is the canonical example because it is the most public lift — everyone can see how much weight is on the bar. The result is a subculture of lifters who can produce a single maximal rep under competition conditions but whose pressing strength, shoulder stability, and actual athletic capability do not match the number they claim.

In AI, the dynamic is the same but the incentive is capital acquisition rather than social approval. Labs select their evaluation methodology based on what the audience — press, VCs, enterprise buyers — requires. Benchmark scores are the most public metric. They appear in announcements, comparison tables, and procurement documents. The result is an industry of models that produce impressive numbers on published tests but whose practical capability, reliability, and consistency do not match the scores they claim.

The gym bro at least knows he is competing. The lift has rules, judges, and a community that can distinguish a legitimate rep from a bounced, half-range ego set. The AI lab presents its benchmaxxed scores as science. There are no judges. The evaluation methodology is chosen by the same lab that trained the model. The journalist writing the article and the VP filling out a vendor comparison spreadsheet have no way to distinguish a legitimate evaluation from a gamed one.

That difference in self-awareness is the gap between a sport and a con.

Goodhart's Law Runs Everything

The benchmaxxing pattern is not new. It recurs in every field that adopts quantitative metrics because humans are structurally incapable of leaving a visible number un-optimized. The history is consistent enough to be a law, and Goodhart stated it: when a measure becomes a target, it ceases to be a good measure.

Lines of code. Early software management measured programmer productivity by lines of code written. Programmers responded by writing verbose code. The metric was abandoned when managers noticed that the most productive engineers often produced the fewest lines because they solved the problem more elegantly. Lines of code measured output volume, not output quality, and optimizing for it produced worse software.

Story points. Agile teams adopted story points as a complexity estimate. Within months, teams started inflating point estimates to look more productive in sprint reviews. Points stopped measuring complexity and started measuring the team's relationship with their manager. Most experienced engineering leaders now treat point velocity as a team health signal, not a productivity metric, because the number has been benchmaxxed past the point of usefulness.

NPS scores. Net Promoter Score was designed as a proxy for customer loyalty. Companies started gaming it by survey timing (ask right after a positive interaction), by filtering (exclude detractors from the sample), and by incentivizing (offer discounts for high ratings). A company reporting an NPS of 70 might have genuinely loyal customers or might have a well-tuned survey pipeline. The number no longer distinguishes between the two.

Bench press PRs. A one-rep max was designed to measure peak pressing force. Competitive powerlifting rules allow extreme back arching that reduces range of motion to a few inches, turning the lift into a test of arch flexibility rather than pressing strength. A lifter who benches 500 pounds with a three-inch range of motion and a lifter who benches 400 with a full range are performing different movements that happen to share a name and a scoring system.

MMLU scores. MMLU was designed to measure general knowledge across academic domains. Labs responded by training on multiple-choice question formats, decontaminating selectively (or not at all), and cherry-picking checkpoints. A model scoring 89.9% might be genuinely knowledgeable or might be well-calibrated to the specific format of MMLU's 57-subject question bank. The score no longer reliably distinguishes between the two.

Token consumption. Companies like Meta and Shopify tracked token consumption as a proxy for AI adoption. Engineers responded by running bots that burn tokens in loops. Tokenmaxxing is the input-side version of the same Goodhart pattern — a metric that was supposed to measure engagement and ended up measuring gaming.

Developer velocity. Engineering orgs adopted deployment frequency, story points, and lines-shipped as proxies for output, and AI coding tools made every one of them trivial to inflate. Will Larson's systems model of LLM adoption shows the deeper failure underneath the gaming: even an honest velocity number is the wrong target, because "the constraint on this system is errors discovered in production, and any technique that changes something else doesn't make much of an impact." Worse, "any approach that increases development velocity while also increasing production error rate is likely net-negative." Velocity is benchmaxxing for the delivery pipeline — a number that looks like progress while the thing it was meant to track, working software actually shipped, stays flat or slips backward.

The pattern is always the same. Someone invents a proxy because the real thing is hard to measure. The proxy works until people start optimizing for it. Then the proxy drifts from the thing it was supposed to represent, and eventually it measures mostly its own optimization. The original question goes unanswered. How strong is this lifter? How capable is this model? Nobody knows, but the number looks great.

What I Actually Do Instead

I should admit that I benchmaxxed my own model selection for longer than I would like to recall. I spent weeks comparing leaderboard rankings when I was choosing between Claude, GPT-4, and Gemini for agentic coding workflows. The model I ranked highest on paper hallucinated through a database migration in production. The model I had ranked third handled the same migration cleanly on the first pass. That was the week I stopped reading leaderboards and started building private eval sets from my actual tasks.

The practical consequence of understanding benchmaxxing is that you stop trusting leaderboards and start trusting your own evaluations. This does not mean benchmarks are worthless. They serve a real purpose as internal regression tests and as coarse filters for eliminating models with obvious gaps. The problem is not the benchmark. The problem is treating an internal calibration tool as an external ranking system.

For AI models, I evaluate by shipping with them. I do not care what a model scores on MMLU. I care whether it can complete a multi-file refactor in my codebase without introducing regressions and whether its code review catches real bugs. Can it follow my instructions, or does it drift into whatever the RLHF reward model rewards? The model has never seen my codebase or my prompts in its training data, which makes this the one evaluation that cannot be gamed. Build a private eval set from your actual work. Fifty tasks. Score them yourself. That is worth more than every public leaderboard combined.

For fitness, the equivalent is evaluating by function rather than by competition numbers. I do not care about my bench press 1RM. I care whether I can press my bodyweight for sets of five and whether my shoulders hold up training overhead. No audience, no leaderboard, no number to optimize for its own sake.

The commonality is the move from public metrics to private evaluation. Public metrics are useful as baselines — a model that cannot score 80% on MMLU probably has real gaps, a lifter who cannot bench 135 probably has a strength deficit. But once you are past the baseline, the public number tells you almost nothing about fit for your specific context. The only evaluation that matters is the one conducted on your tasks, with your criteria, scored by you.

The Vocabulary Is the Contribution

The reason these terms matter is not because the concepts are new. Goodhart's Law is fifty years old. Vanity metrics have been criticized since the first software engineer was ranked by lines of code. What is new is the vocabulary.

Tokenmaxxing names the input side of AI's measurement crisis: companies tracking how much AI is consumed rather than how much value it produces. Benchmaxxing names the output side: labs tracking how well models score on tests rather than how well they perform on work. Together, the two words give the industry a compact way to describe the full Goodhart failure mode without having to explain the economics paper every time.

That matters because vocabulary shapes what an industry is able to discuss. Before "technical debt" had a name, teams struggled to explain why they needed to spend time on code that already worked. After the term entered common usage, a developer could say "we have tech debt in the auth module" and the room understood. Naming the problem is half of solving it, because nameable problems get budget and attention while unnamed problems stay invisible.

Benchmaxxing and tokenmaxxing are the technical-debt vocabulary for AI measurement. They name the specific ways the industry's metrics have drifted from the things those metrics were supposed to measure. The next time a VP asks "why shouldn't we just pick the model that tops the MMLU leaderboard?" and the answer is "because the leaderboard has been benchmaxxed," the conversation can move forward to "so how do we evaluate instead?" rather than stalling on a thirty-minute explanation of Goodhart's Law, benchmark contamination, and checkpoint cherry-picking.

The words do the work. That is the contribution.

FAQ

What does benchmaxxing mean?

Benchmaxxing has two meanings depending on context. In gym and fitness culture, it means obsessively maximizing your bench press numbers — chasing a one-rep max as a status metric rather than pursuing balanced functional strength. In AI and machine learning, it means optimizing models specifically for high scores on public benchmarks rather than for real-world capability. Both meanings share the same underlying pattern: pursuing a visible number past the point where the number still measures what it was designed to measure. The "-maxxing" suffix comes from the same internet subculture that produced looksmaxxing and gymmaxxing, where it signals optimization pushed to an extreme.

Is benchmaxxing the same as Goodhart's Law?

Benchmaxxing is a specific instance of Goodhart's Law, which states that when a measure becomes a target, it ceases to be a good measure. In the gym context: bench press weight was originally a measure of upper-body strength, but once lifters started optimizing specifically for 1RM bench numbers (with arch, leg drive, reduced range of motion), the number stopped correlating with functional pressing power. In the AI context: MMLU scores were designed to measure general knowledge, but once labs started fine-tuning on benchmark-format data and cherry-picking evaluation checkpoints, the scores stopped correlating with practical model capability. Both are Goodhart's Law — benchmaxxing is just what it looks like when Goodhart applies to a competitive leaderboard.

What is the difference between benchmaxxing and tokenmaxxing?

Benchmaxxing and tokenmaxxing are opposite sides of the same measurement problem. Benchmaxxing games output metrics: AI labs inflate what their models appear to be capable of by optimizing for benchmark scores. Tokenmaxxing games input metrics: companies inflate how productive their employees appear to be with AI by tracking token consumption. Benchmaxxing produces fake capability claims; tokenmaxxing produces fake productivity claims. Together they describe the full Goodhart failure mode of AI measurement in 2026. For the factual explainer on benchmaxxing, see the definition on ctaio.dev. For tokenmaxxing, see the companion piece.

Why do AI companies game benchmarks?

Because benchmark rankings drive the three things that matter most to an AI lab: press coverage, enterprise procurement, and fundraising. A model that "tops the MMLU leaderboard" gets a TechCrunch headline that feeds into enterprise RFPs and VC pitch decks. A model that "performs well on a representative sample of customer tasks" does not. The incentive structure rewards benchmark performance regardless of whether the scores translate to real-world capability. Labs are not being irrational — they are optimizing for the metric their market rewards, which is exactly what Goodhart's Law predicts will happen.

How is bench press max similar to AI benchmark scores?

Both are single-number summaries of a complex capability that get treated as the capability itself. A bench press 1RM tells you what someone can lift once under competition conditions — it says nothing about endurance, functional pressing strength, shoulder health, or athletic performance. An MMLU score tells you what a model can do on 57 categories of multiple-choice academic questions — it says nothing about instruction following, nuanced reasoning, code generation quality, or real-world task completion. In both cases, the number is useful as a rough baseline but breaks down as a ranking tool once everyone starts optimizing for it. And in both cases, the people who actually use the capability (athletes, developers) evaluate it differently from the people who rank it (competition judges, leaderboard curators).

Related reading: For the factual AI definition and technical mechanics, see What Is Benchmaxxing? on ctaio.dev. For the companion metric problem on the input side, see the tokenmaxxing definition and my personal tokenmaxxing stack.

Only 3 slots available this month

Ready to Transform Your AI Strategy?

Get personalized guidance from someone who's led AI initiatives at Adidas, Sweetgreen, and 50+ Fortune 500 projects.

Book Your Free Strategy Call View My Background

Trusted by leaders at

Google · Amazon · Nike · Adidas · McDonald's