Who Will Win the RL Environment Market—and Why

Perspectives

In my last article, RL Environments for Agentic AI, I argued that reinforcement learning (RL) environments are becoming the constraint for agentic AI—because verification, not raw model capability, is what makes automation durable. That framing has a direct competitive implication: the teams that win won’t look like traditional tooling vendors; they’ll look like thought partners embedded with frontier labs, compounding trust and research depth over time. This second piece zooms in on what that means for the market outcome: who wins the RL environment layer, and why.

Between 2026 and 2030, the RL environment market will narrow decisively. What today looks like roughly 20 seed- to Series A-stage companies on relatively similar footing—be they forward-deployed teams, early environment builders, or research-heavy startups—will resolve into three to five market leaders, with one to two dominant platforms pulling meaningfully ahead.

This is not a land-grab driven by environment count or early demos. It is a selection process driven by two reinforcing advantages:

  1. Who earns “thought partner” trust from the frontier AI labs early, and
  2. Who builds the research organization capable of industrializing replication training and verification.

Labs today are capturing low-hanging fruit in many simple environments to teach AI to use tools, especially to train v0 of computer use. In the future, labs will optimize less for the volume of environments shipped, as basic applications are trained. They will optimize for complexity, quality, and what comes next, and they will increasingly concentrate their spend on the few teams that help them push the frontier forward.

Here are six key principles of what will determine who breaks out in 2026-2030:

  1. Frontier labs will choose embedded thought partners, not on-demand vendors
  2. RL environments will evolve from brittle artifacts to automated infrastructure
  3. Replication/environment training and verification define the research moat
  4. Lab trust and research depth are self-reinforcing
  5. Frontier lab work is the foundation; enterprise work is the multiplier
  6. Depth in core and complex domains (i.e., coding) beats environment breadth (lots of environment apps)

1. Frontier Labs Will Choose Embedded Thought Partners, Not On-Demand Vendors

RL environments are currently still immature. Many are single-application, brittle to UI or workflow changes, and partly manual in how tasks are defined or graded. This is acceptable today, but it will not be as agent autonomy increases.

Agents can already run autonomously for two to three hours so long as the setting is constrained. As autonomy increases, training shifts from isolated tasks to long-horizon workflows spanning multiple environments, where state persists, decisions compound, and success only comes much later. This shift fundamentally changes what labs need.

As a result, frontier labs aren’t looking for vendors who can “build environments on request.” They’re looking for thought partners: teams who can help define how RL infrastructure must evolve as agents become more autonomous. In practice, this means co-designing new kinds of environments, stress-testing verification in ambiguous settings, and experimenting with replication training regimes that don’t yet have established playbooks.

Buyer concentration is a feature, not a bug. Winning a few anchor lab relationships matters far more than broad distribution. As a reminder, Scale AI had over 80% revenue concentration in roughly five AV customers going into 2019-2020. 

Key Takeaway: Labs gravitate toward teams they trust, like working with, and treat as extensions of their own research organizations. Frontier labs are increasingly the gravitational center for RL talent. Startups that maintain live research pipelines into academia and labs inherit that gravity; others slowly fall out of relevance.

2. Environments Become Environment Factories: From Brittle Tasks to Automated Infrastructure

The technical evolution of RL environments is clear. RL environments will move from single-app, hand-curated tasks; annual or semi-manual grading; and fragile assumptions about tools and interfaces, toward multi-environment workflows; automated task generation and variation, and hybrid verification systems that scale with agent capability.

The end state isn’t “better environments.” It’s automated RL infrastructure: environment factories that continuously assemble, orchestrate, test, and refresh environments as agents improve and software ecosystems evolve.

This shift will be gradual. RL environments will stay largely manual for some time; if the entire pipeline were already automatable, frontier labs would simply build it in-house. The fact that they haven’t is precisely why this market exists.

Key Takeaway: As agents improve, environment quality—not environment count—becomes the binding constraint. Labs will reward teams that can absorb complexity, keep environments stable, and anticipate what the next generation of training will require.

3. Replication training and hybrid verification are the compounding technical moat

The core research challenge over the next five years is not simply building more RL environments, but turning environment training into a scalable system. This is where replication training becomes the defining primitive.

Replication training replaces bespoke, one-off environment building with controlled repetition at scale. Instead of constantly inventing new tasks, teams pick a small set of high-value, long-form workflows and run them thousands of times across slightly varied environments—different inputs, starting states, tool access, and constraints—while keeping the underlying workflow stable. Learning compounds not from novelty, but from repeated exposure across changing conditions.

Building reliable, extensible replication training is non-trivial. Replication training requires:

  • Deterministic reset and replay, so workflows can be run repeatedly without drift
  • Environment abstraction layers, which allows for variation without breaking realism
  • Parallel orchestration, to run large numbers of long-horizon trajectories
  • Instrumentation and telemetry, to surface where and why agents fail over time

The objective is to extract maximal learning signal from a minimal set of environments—driving down marginal cost while preserving fidelity.

Verification is the harder half of the system. In long-form, multi-environment workflows, there’s rarely a single correct answer, and as agents improve, the reward signals often become noisier, not cleaner.  That’s why winning systems use closed, hybrid verification loops: experts score model outputs, models review expert judgments, humans supervise and correct models, and algorithms reconcile disagreement, detect drift, and curate edge cases

In this regime, data work becomes intellectual—the job is to construct and maintain a reliable training signal in the presence of ambiguity. Replication training only works if this verification loop remains stable as scale increases. That stability is a research problem, not a tooling problem.

Prompt optimization systems like GEPA are not alternatives to RL infrastructure, but adjacent primitives that plug into the same replication and verification systems.

The core skills required to build this system are rare and cross-disciplinary:

  1. Systems engineering, to manage orchestration, replay, and scale
  2. Applied ML research, to design rewards and diagnose failure modes
  3. Human-in-the-loop design, to balance automation and judgment
  4. Product intuition, to decide which variations actually teach the model something new

Teams that master these skills turn replication training into a compounding advantage—and define the real moat in RL environments.

Key Takeaway: Replication training only becomes a moat when it’s paired with a stable, closed-loop hybrid verification system—because the winner isn’t the team with the most environments, it’s the team that can reliably extract learning signal from a small set of long-form workflows at scale. That combo (repeatable training + compounding, ambiguity-resistant verification) drives down marginal cost, prevents drift, and turns every run into durable advantage.

4. Research Lineage Compounds: Trust → Talent → Frontier Exposure → Better Systems

These dynamics create a powerful flywheel, and research credibility is the entry ticket. Teams with genuine research lineage—often signaled simply by who built them—earn trust faster with both labs and investors.

Key Takeaway: Teams with strong research organizations win deeper lab engagements. Those engagements expose them to frontier failure modes earlier than the rest of the market. That exposure feeds directly into better replication training systems, stronger verification pipelines, and lower marginal cost per new environment. Over time, this compounds into durable technical advantage.

5. Lab-first builds the core; Enterprise scales the distribution

While lab trust, research depth, and talent reinforce one another, they are not equal inputs. If one must serve as the foundation for breakout success, it is unequivocally frontier lab work. This is where RL infrastructure is actually built. Enterprise work, by contrast, is where that infrastructure is applied, customized, and monetized.

Frontier labs force teams to solve the hardest problems first. Environments must generalize across models. Replication training must work at scale. Verification must hold up under long-horizon ambiguity. Orchestration, replay, and evaluation-to-performance correlation cannot rely on customer-specific shortcuts. Lab buyers are purchasing infrastructure primitives—environments, tasks, training loops—not bespoke outcomes. That pressure produces reusable systems rather than one-off solutions.

Enterprise demand today looks different. Most enterprises are not buying RL environments or training infrastructure directly. They are buying agentic applications tied to concrete workflows and KPIs, often delivered through forward-deployed engineering. This work is valuable, but inherently more custom. Enterprise agents benefit disproportionately from infrastructure that has already been hardened upstream in lab settings—effectively “RL menus” rather than bespoke training from scratch.

The asymmetry matters. Infrastructure built for labs lowers the marginal cost, risk, and time-to-value of enterprise deployments. Replication training pipelines, environment factories, and hybrid verification systems developed for lab use increasingly power enterprise work behind the scenes. Over time, what looks like bespoke enterprise delivery becomes configuration on top of lab-derived infrastructure.

Key Takeaway: Frontier lab work is the foundation. It produces the training and verification infrastructure that later makes enterprise deployments cheaper, faster, and more scalable. The teams that break out will be lab-first platforms that use enterprise demand as a multiplier—not the other way around.

6. Depth in Core, Complex Domains Beats Environment Breadth

In the early stages of building enduring RL infrastructure, depth in a small number of complex domains matters far more than breadth across many shallow ones. Not all environments are equally valuable. The environments that matter most are those where success is hard to define, trajectories are long, tools are central, and failure modes are subtle. Coding and computer use sit squarely in this category, with coding as the holy grail.

Coding is not just another task—it is a meta-domain. It combines long-horizon reasoning, tool invocation, stateful context, error recovery, and verifiable outcomes. Crucially, it also offers dense feedback: tests pass or fail, programs compile or break, diffs can be evaluated, and performance can be measured. This makes coding one of the few domains where high-quality reinforcement signals are available at scale, even as the task itself remains cognitively demanding.

This is not just a theoretical advantage; it is already reflected in market demand. Coding is the largest application vertical for the leading AI labs. Claude Code reached roughly $1 billion in ARR within six months. Microsoft Copilot is already a multi-billion-dollar business, and OpenAI’s Codex is generating hundreds of millions in annualized revenue. These products sit at the frontier of long-horizon, tool-heavy agent behavior and are natural early consumers of RL environments.

The most successful AI-native applications—and the likely first enterprise adopters of RL environments to stay competitive—are overwhelmingly coding-centric. Cursor is approaching $2B in ARR, with teams like Windsurf/Cognition close behind, and a fast-growing cohort of newer startups such as Cline and Kilo AI emerging. These companies are not experimenting at the edges; they are pushing agents into production workflows where reliability, verification, and continuous improvement matter.

By contrast, many RL environment companies are adopting environment-first strategies optimizing for surface area rather than depth. Teams build dozens of narrow environments—CRM editing, scheduling / email drafting, Slack clones—that demonstrate capability but rarely compound. These environments are short-horizon, weakly coupled to tools, and often rely on brittle heuristics or human-in-the-loop grading. They are easy to demo, but hard to generalize, and they produce limited reusable infrastructure. They provide initial value to labs training computer use, but will quickly commoditize. 

Depth compounds in ways breadth does not. Teams that invest deeply in coding and complex computer-use environments are forced to solve hard, reusable problems early: scalable replication training, hybrid verification, trajectory replay, and eval-to-performance correlation. Improvements in these systems transfer across tasks within the domain and, over time, into adjacent domains like data analysis, DevOps, and general computer use.

This is why coding and computer use are the fastest initial pickup markets for RL environments. They sit at the intersection of high economic value, high task complexity, and strong verifiability. Teams that win here are not just shipping better agents—they are building the core training and evaluation infrastructure that makes broader generalization possible later.

Key Takeaway: Early advantage in RL environments accrues to teams that go deep in a small number of complex, high-signal domains—especially coding and computer use. Breadth can be added later. Depth, once skipped, is hard to recover. Coding is the single best area of expertise to build a foundation off of. 

Market Landscape Today in 2026

The emergence of RL environments doesn’t displace incumbent labeling vendors, but it reshapes where value accrues. Traditional labeling is roughly a $5B market today (growing in excess of 50% YoY). The broader AI training data market ahead is likely to be significantly larger—but structurally different. Scale and workforce orchestration mattered in the labeling cycle. In the next one, research depth, domain specialization, and systems-level integration matter far more.

Incumbents like Surge AI, Mercor, Turing, Invisible, and others already generate billions in aggregate revenue serving frontier labs with expert labeling, fine-tuning data, and evaluations. That demand is expanding, not shrinking—especially in coding, which is already the highest-value expert vertical across these platforms. Coding-heavy expert work is where long-horizon reasoning, tool use, and verifiability converge, making it the natural bridge from labeling to RL environments. 

That said, most incumbents are still optimized for a services-first, episodic workflow model, not for continuous learning systems. They aren’t research-first orgs at their core today. RL environments require reusable environment abstractions, deterministic task design, scalable replication training, and tight eval-to-performance feedback loops. These are infrastructure and research problems, not labor problems. The founding makeup of teams like Bespoke and Applied Compute are a mix of ML Scientists/Engineers and FDEs, not labelers, project managers, and software engineers.

It would not be surprising if one or two incumbents with strong research organizations successfully adapt to this new vertical. Surge AI is the clearest candidate today, and Snorkel has the intellectual foundations even if it lacks scale. But structurally, this market favors specialists built natively for RL environments. Depth compounds faster than breadth, and teams designed from day one to build training systems—not just manage expert throughput—are best positioned to capture the long-term value.

Which companies will win in 2030?

By 2030, the winners in RL environments will not be the teams with the most demos, the widest environment catalogs, or the largest services organizations. They will be the teams that build real infrastructure—and do so in close partnership with frontier labs to industrialize replication training, verification, and long-horizon environment orchestration.

That reality narrows the field. There will likely be 3 to 5 significant winners as labs concentrate spend among a small set of trusted partners. This market may end up resembling data labeling, which already has roughly three $1B+ revenue players (Scale, Mercor, Surge) and more than five $100M+ players (Turing, Micro1, Invisible, Handshake, etc.).

Most early-stage players lack one of the two non-negotiables: deep research capability or sustained lab trust. Without both, teams risk stalling as forward-deployed shops or commoditized environment builders as labs consolidate spend with a small number of thought partners. Incumbent labeling vendors will participate in this market, but few are structurally positioned to lead it.

In the end, this market will not be won by teams that apply RL environments fastest, but by those that help labs define how RL environments should exist at all. The enduring platforms will be built by teams that partner with frontier labs to turn ad-hoc experimentation into replication training systems, fragile graders into robust verification loops, and one-off demos into durable infrastructure.

That work is slow, research-heavy, and often invisible from the outside, but it compounds. And by the time RL environments become an obvious, enterprise-scale category, the winners will already be decided: not by market share today, but by who the labs trusted to build the training layer when it still mattered most.

Read Full Article
Chris Zeoli
Author
No items found.
Wing Logo in blue, all lower case letters.
Thanks for signing up!
Form error, try again.