Now that large language models (LLMs) and foundation models have become everyday concepts, the next big idea in AI is world models. For the past two years, the industry has poured capital and compute into making language models bigger, faster, and cheaper. That race produced real breakthroughs; but it also revealed a ceiling. Language models predict the next word. They can summarize a research paper, draft a contract, or write working code. What they cannot do is understand what happens when a robotic arm pushes a box off a shelf, or simulate how a warehouse layout change affects throughput over the next quarter. World models can. In the past 18 months, more than $2 billion has flowed into startups building these systems, and the trajectory of that capital tells me the market is beginning to price in a new layer of the AI stack. I think world models represent the most important architectural shift in AI since transformers, and the companies that master them will define the next decade of autonomy, simulation, and physical intelligence.
Table of contents
- What are world models?
- How world models differ from LLMs
- Where world models are already working
- What this means going forward
What are world models?
A world model is an AI system that builds an internal representation of how an environment works — its objects, physics, and rules — and uses that representation to predict what happens next given a particular action. Where a language model asks "what word comes next?", a world model asks "what state comes next?"
Think of one of these models as a toddler learning about gravity. They can't do it by reading a book and they certainly can't derive F=mg from Newton's law of universal gravitation; rather, they pick up the intuition of the concept when they see a bottle of milk or a toy fall off a table. The toddler doesn't need the formula. They observe enough falling objects and build a working mental model of gravity that lets them predict new situations: if I push this cup, it will hit the floor. World models work the same way. They learn from observation — video, sensor data, physics simulations — and develop internal representations of how things move, collide, and change over time.
Architecturally, most world models combine three core modules. The first is perception: encoding the current environment into a compressed internal state. The second is prediction: simulating what happens next by rolling that state forward in time based on proposed actions. The third is planning: evaluating multiple possible action sequences against the predicted outcomes and selecting the best one. This perceive-predict-plan loop is what gives world models their distinctive capability. They don't just classify or generate — they reason about consequences.
The concept itself isn't new. What's changed is compute, data, and architecture. The same transformer and diffusion breakthroughs that powered the LLM era now make it possible to train world models on massive video and simulation datasets, bringing a 30-year-old idea into practical reach for the first time.
How world models differ from LLMs
The distinction between world models and LLMs is not just a matter of degree — it is a difference in what the systems fundamentally learn. Understanding that gap matters because it determines which problems each approach can and cannot solve.
Learning method. LLMs train on massive text corpora by predicting the next token in a sequence. They learn statistical relationships between words. World models, by contrast, learn through observation and reinforcement — video streams, physics simulations, robotic sensor feeds — and build compressed representations of how environments evolve. An LLM learns that the sentence "the ball fell" is likely to be followed by "to the ground." A world model learns why the ball fell and where it will land given its velocity and the angle of the surface.
Causal understanding. LLMs recognize language about cause and effect, but they don't truly model it. They can write a convincing paragraph explaining why bridges fail under stress, but they cannot simulate the failure. World models internalize causal structure. Given a current state and an action, the model simulates forward and predicts the resulting state — not because it has read about similar situations, but because it has built an internal physics of the domain.
Spatial and temporal reasoning. LLMs operate in the space of text — a fundamentally one-dimensional, sequential medium. World models operate in three-dimensional (and often four-dimensional, including time) environments. They track objects, surfaces, forces, and movement in space. This is why world models are essential for robotics, autonomous driving, and any application where an AI system needs to understand and navigate the physical world.
Planning under uncertainty. When LLMs attempt multi-step reasoning, they do it autoregressively — one token at a time — with no ability to simulate the downstream consequences of their choices. Errors compound. World models can simulate multiple action sequences in parallel, evaluate outcomes, and select the path with the highest expected reward. This is the difference between guessing what might work and testing what will.
Where world models are already working
World models are no longer a research abstraction. Several companies are shipping products built on world model architectures, and the applications cluster into a few clear categories.
Robotics and physical AI. This is the most natural use case. A robot operating in an unstructured environment — a warehouse, a kitchen, a construction site — needs to predict how objects will move, how forces interact, and how its own actions change the scene. Nvidia's Cosmos platform generates massive synthetic training data for robotics by simulating realistic physical environments, letting developers train embodied AI systems without the cost and risk of real-world data collection. The approach compresses what would take years of physical-world data gathering into weeks of simulation. This matters because the bottleneck in robotics has never been the hardware — it's the data. Physical robots are slow, expensive, and fragile to train. World models break that constraint by generating millions of realistic training episodes in simulation, then transferring the learned behaviors to physical systems.
Video and content generation. World models are redefining what AI-generated video can do. Runway's General World Models (GWM-1) generate video that obeys real-world physics — objects fall, light refracts, materials deform consistently — because the underlying system models the scene rather than hallucinating plausible-looking pixels frame by frame. World Labs, co-founded by Fei-Fei Li, built Marble, a system that reconstructs navigable 3D environments from still images. The output isn't a flat render — it's a spatial world you can move through and interact with.
Simulation and digital twins. Industrial companies are using world models to simulate complex systems before committing capital. A logistics operator can model how a new warehouse layout affects throughput, error rates, and labor requirements. An energy company can simulate grid behavior under different load scenarios. The world model learns the dynamics of the system from historical data and then rolls forward under hypothetical conditions — the same perceive-predict-plan loop, applied to supply chains and infrastructure instead of robotic arms.
Gaming and immersive environments. Google's Genie 3 generates real-time explorable environments from a text description or a single image — no game engine required. Decart built playable simulations that run entirely on a world model, eliminating the traditional game development pipeline. These are early products, but they point toward a future where interactive 3D content can be generated on demand, at a fraction of the time and cost of manual creation. For context, a single AAA game environment can take a team of artists months to build. A world model can generate a playable approximation in seconds.
What this means going forward
The next major platform companies won't just be better at talking about the world. They will be better at understanding how it works — how objects move, how systems respond to interventions, how environments evolve over time. That capability is what separates an AI that can describe a factory floor from one that can optimize it.
World models are still early. Training efficiency needs to improve by an order of magnitude before these systems can generalize across domains the way LLMs generalize across text. Real-world transfer — taking a model trained in simulation and deploying it reliably in physical environments — remains a hard engineering problem. But the direction is clear, the capital is flowing, and the technical foundations are advancing fast. I expect the next two to three years will produce the first breakout companies in this space: startups that combine world model architecture with proprietary data in robotics, simulation, or spatial computing to build products that were simply not possible in the LLM-only era. The winners will look different from LLM companies — they will be deeply vertical, grounded in domain-specific data, and measured by how well their models predict physical outcomes, not how fluently they produce text.
The shift from words to worlds is underway. If you are building in this space, I would like to hear from you.

