How We Evaluated Training-Data Vendors at Scale

The procurement rubric that survives the sales decks: paid blinded pilots, rework rate as the headline metric, off-ramp clauses, and the question almost no one asks.

Data labellers comparing annotated rows of training data across multiple vendor outputs
Data labellers comparing annotated rows of training data across multiple vendor outputs

Key Takeaways

  • The lane decision precedes the vendor decision — Four lanes: high-end RLHF, large-scale workforce, platform-led, synthetic. Pick the lane first; the vendor field inside a lane is short.
  • Rework rate is the headline metric — Delivery quality is hard to measure honestly. Rework rate is easy to measure and correlates more tightly with eventual model performance than any other vendor-side number.
  • The five stages that actually work — Rubric, shortlist, paid pilot, score, contract with off-ramp. Skip the off-ramp once and you will write this essay yourself.
  • Hybrid combinations are the norm — Premium-plus-workforce, synthetic-plus-edge-case-human, platform-plus-BPO. Pure single-vendor production programmes are rare in this category.

Why training-data procurement is harder than the deck makes it look

Procurement processes for training-data vendors look superficially like software procurement and behave nothing like it. The deliverable is bespoke. The quality definition is project-specific. The supply chain includes humans whose performance varies day to day. The pricing is sales-led and opaque across nine of the ten serious vendors in the category. Most procurement frameworks built for the SaaS market produce the wrong answer when applied here without modification.

I have run versions of this procurement exercise inside organisations of three different sizes, on different timelines, with different shapes of internal expertise available. The patterns are consistent enough that this piece can describe them as a single playbook. The companion vendor-comparison piece on wetheflywheel.com/en/guides/ai-training-data-providers-2026/ covers the field of ten serious vendors and how they split across the four lanes (high-end RLHF, large-scale workforce, platform-led, synthetic). This piece is the procurement-side companion to it.

Stage one: write the rubric before the first vendor call

The mistake I see most often is treating the rubric as something the procurement team will discover during the evaluation. The shape of "what good looks like" emerges from vendor conversations, the criteria get backfilled, and the final scoring is shaped by the vendor that ran the most thorough sales process. That is a defensible way to run an enterprise software evaluation. It is not a defensible way to run a training-data evaluation, because the vendors will anchor the criteria to their structural strengths within the first thirty minutes of every conversation.

The rubric should be on paper before any vendor calls. Six fields:

  1. The task. Specifically what is being labelled, with examples drawn from the production data pipeline.
  2. Languages. The full list, with volume splits if multilingual.
  3. The quality bar. Agreement rate target with the internal reference set, edge-case consistency target, false-positive-tolerance per label class.
  4. Volume and timeline. Total labels, peak weekly capacity, the realistic delivery window.
  5. QA model. Multi-pass, multi-rater, expert adjudication, or a hybrid. Specified explicitly.
  6. Reference set. The blinded ground-truth set the vendor will be scored against. Size and origin specified.

The reference set is the load-bearing field. Without a blinded ground-truth set you control, you are scoring vendors against each other and against their own claims rather than against ground truth. The cost of building the reference set is three to five days of senior internal time. It is the most under-budgeted part of the procurement and the part that decides whether the rest of the process produces a defensible answer.

The reference-set diagnostic
If your evaluation does not start with a reference set you control, you are not running a procurement. You are running a sales-led demo tour with a contract at the end.

Stage two: shortlist three vendors per lane

Not five. Not one. Three is the number that gives you negotiating leverage on the eventual production contract, and three forces you to commit to actual differentiation rather than a tour of the category.

The lane decision precedes the shortlist. The four lanes (high-end RLHF, large-scale workforce, platform-led, synthetic) are documented in the WTF vendor comparison. Pick the lane based on the failure mode you most need to engineer around: frontier-model RLHF goes to high-end, multilingual volume goes to workforce, internal-team-with-tooling-need goes to platform-led, narrow domain at scale goes to synthetic. Once the lane is picked, the vendor field is small. Three out of three-to-four real options per lane is a reasonable scope.

Most production programmes I have run end up with vendors from two lanes in parallel, but the procurement exercise is cleaner if it runs lane-by-lane and the hybrid combination decision comes after the pilot stage rather than before it.

Free pilots run on vendor priorities. Paid pilots run on yours. The cost difference (typically $5k–$30k per vendor, depending on volume) is small relative to the cost of an annual production contract; the information density is much higher.

Run two to three vendors in parallel on identical scope. Two weeks is the right window. The vendor teams have time to set up their workforce properly, the data has time to flow through the QA model, and your internal evaluators have time to inspect the output against the reference set without rushing.

Two things to ask during the pilot specifically:

  • How does your QA scale with task complexity? Almost no one asks this; the answer is consistently the most predictive of production performance.
  • What does your edge-case escalation path look like? The strong vendors have a clear expert-adjudication tier for difficult cases. The weak ones apply a single QA model uniformly and rework rate climbs with complexity.

The single question that separates serious vendors from the rest is what their QA model does when the labels get hard. Most procurement decks never ask it.

Stage four: score against rubric and reference

The scoring conversation should take half a day with two senior people in the room. The headline metric is agreement rate with the reference set. The under-measured metrics are rework percentage and edge-case consistency. The combination of all three predicts production quality better than any single number.

A defensible scoring spreadsheet has agreement rate per label class, rework percentage flagged by the internal review, edge-case consistency on the harder ten percent of the reference set, throughput against the timeline commitment, communication quality during the pilot, and a notes field for the structural observations that do not fit into a number. The total score is informative but the conversation about what each vendor is genuinely strong at and where the risk is concentrated is more useful than the rolled-up rank.

The vendor that scores second on the headline metric and first on the conversation about strengths is often the right pick, because the second-place headline is usually a small gap and the strengths conversation predicts how the relationship will operate at production volume.

Stage five: the production contract (and the off-ramp clause)

The four contract terms that matter most:

  1. SLA on rework. Not on delivery, on rework. The metric the vendor is on the hook for is the rate at which delivered labels need redo, not the rate at which labels arrive. Otherwise the incentive is to deliver fast and rework slowly.
  2. Capacity commitments. Peak weekly volume the vendor will honour without re-pricing. Most production programmes have a launch spike and the spike capacity is the binding constraint.
  3. Off-ramp clause. Explicit termination terms, data-portability commitments, no exclusivity clauses buried in the schedule of work. This is the step almost everyone skips, and it is the single piece of paperwork that pays for itself every time.
  4. Pricing transparency. Per-label cost broken out by complexity tier. The opaque-quote vendors will resist this; the strong ones already have the breakdown in their internal pricing model. Insist on it.

Negotiate the off-ramp at signing. The vendor relationship is at its best in the first six weeks. Asking for the off-ramp during honeymoon does not signal distrust; it signals that you have run this procurement before.

Hybrid combinations that work in practice

Most production programmes I have advised on end up running two or three vendors in parallel, not one. The combinations that recur:

Combination Vendors Use case
Premium + workforce Scale or Surge for RLHF + Toloka or Appen for classical labelling High-end-plus-scale. Premium spend on the parts that drive quality; workforce-scale spend on the volume work.
Synthetic + human edge cases Snorkel AI for programmatic + a workforce vendor for the hard tail Cuts overall cost by 40% or more on suitable lanes. Right when image, text, or code-generation labelling dominates and the edge cases are well-defined.
Platform + bring-your-own-workforce Labelbox for tooling + a BPO or in-house team Right when you already have annotator capacity and want to upgrade the tooling rather than the people.
Regulated-domain expert + volume iMerit for medical/financial expert work + Sama or Appen for the routine Common in healthcare, financial-services, and legal AI programmes where cost per label varies 5–20x by complexity.

The pattern across all four is the same: premium spend on the parts that drive eventual model quality, workforce-scale or synthetic spend on the volume work. Pure single-vendor programmes optimise for procurement simplicity at the cost of total programme economics, and they are rare in mature production stacks.

What I would tell a buyer running this for the first time

  1. The reference set is the load-bearing artefact. Build it before talking to vendors. Make it blinded. Use it consistently across the pilot stage.
  2. Paid pilots, not free trials. Different conversation, different vendor behaviour, different data quality.
  3. The off-ramp clause is the cheapest insurance you will ever buy. Write it in.
  4. Rework rate beats delivery quality as a headline metric. A vendor that delivers fast but reworks 30 percent of labels costs more in total than a vendor that delivers slower at 5 percent rework.
  5. Plan for two vendors, not one. The hybrid combinations are the production-economics default. Plan the procurement to converge there rather than into a single-vendor pattern that will need re-procuring inside a year.

Why is procurement in this category so much harder than software procurement?

Because the deliverable is bespoke, the quality definition is project-specific, the supply chain involves humans whose performance varies across days, and the pricing is opaque. Software vendors compete on a feature matrix you can read in twenty minutes. Training-data vendors compete on outputs that you have to inspect and measure yourself. The procurement cycle is longer because the evaluation work is longer. There is no way around that.

What is the single most-skipped step?

Writing the off-ramp clause into the first contract. Procurement teams skip it because the relationship is going well at signing time and nobody wants to lead with the divorce conversation. Then six months in, the vendor relationship gets bumpy, and you discover the contract has an exclusivity clause buried in the schedule of work that makes the second vendor uneconomical. The off-ramp clause is the single piece of paperwork that pays for itself every time, and it is the easiest to negotiate at signing because the vendor has not yet locked you in.

How long is a realistic procurement cycle?

Four to eight weeks for enterprise engagements. Week one for the rubric. Week two for the shortlist conversations. Weeks three and four for the paid pilots. Week five for scoring. Weeks six and seven for the production contract. Week eight is the buffer that always gets consumed by something. Compress this to two weeks and the rubric will be wrong, the pilots will be insufficient, and the contract will be missing the off-ramp.

How did you measure rework rate?

On the pilot output, I scored each vendor against a known reference set I controlled. The rework rate was the percentage of labels that I would have asked the vendor to redo if this had been a production batch. That measurement requires a real reference set; building one is a meaningful upfront cost (typically three to five days of senior internal time) and it is the only honest way to compare vendor outputs. Without it, you are scoring vendors against each other rather than against ground truth.

When should you go synthetic rather than human?

For narrow domains, yes; for frontier RLHF on subjective reasoning, not yet. Image classification, text classification, and code-generation post-training are the lanes where Snorkel AI or a programmatic alternative is genuinely competitive on quality and dramatically cheaper. RLHF on subjective tasks still benefits from human raters; the synthetic-to-human gap is narrowing but is still real. The right answer in most production programmes is a hybrid: synthetic primary supply, human annotators for the edge cases that drive model performance.

What is the question almost no one asks during vendor calls?

How does your QA scale with task complexity? Most vendors will tell you their QA process for standard labels. The interesting answer is what happens when the labels get harder: multi-step reasoning, subjective preference judgements, expert-domain knowledge work. The QA model that works for straightforward classification breaks for these. The vendors that handle complexity well usually have a tiered QA model with expert adjudicators in the loop for harder tasks. The vendors that struggle usually have a single QA model applied uniformly and the rework rate climbs sharply as complexity rises.

How does this connect to the WTF guide on the same category?

That guide ranks the ten leading vendors across eight axes (lane positioning, worker selection, QA depth, RLHF readiness, constitutional AI, pricing transparency, minimum engagement, and capability breadth). It is the procurement-grade comparison document. This piece is the operating-experience version that sits behind it. The two are designed to be read together; the WTF guide is the spec, this piece is the field test. The link is at wetheflywheel.com/en/guides/ai-training-data-providers-2026/.

Only 3 slots available this month

Ready to Transform Your AI Strategy?

Get personalized guidance from someone who's led AI initiatives at Adidas, Sweetgreen, and 50+ Fortune 500 projects.

Trusted by leaders at
Google · Amazon · Nike · Adidas · McDonald's