World models, and where memory sits.
Everyone agrees 2026 is the year of the world model. Fewer agree on what a world model is for. This is our reading of the field, and a precise claim about the layer underneath it: a memory of the real Earth that a model can cite by date and train on.
Chapter I
The year of the world model.
The labs that set the agenda all moved in the same direction this year: past language, toward systems that hold a model of physical space.
Fei-Fei Li has spent the year arguing that spatial intelligence is the next frontier of AI. The field has mastered words; it is only beginning on the harder problem of space, geometry, and the physical world. Her new lab, World Labs, exists to build exactly that, and raised a billion dollars in February on the thesis.
Jensen Huang put it as a layered stack at CES, a five-layer cake of AI in which the most valuable models are the ones that understand the physical world, not just text on a screen. NVIDIA’s own answer, Cosmos, is a family of world foundation models aimed squarely at physical AI.
Goldman Sachs, writing for people who allocate capital rather than train networks, called the world model AI’s missing link: the reason a system that aces a language benchmark still cannot be trusted to act in a warehouse, a field, or a flooded street.
We agree with the direction and want to be precise about one thing. A world model that imagines a plausible world is a different object from a record that remembers the real one. Both are useful. They are not the same, and an agent acting on the second mistaken for the first is how you get fluent, dated, confident error.
Chapter II
One stack, four jobs.
Niantic’s own argument is that the best systems combine these layers rather than pick one. Each does a distinct job, and each needs something to be true about a real place.
Language models reason. Give them the right context and they plan, summarise, and decide well. They hold no model of a specific place, so when the context runs out they reach for a prior and call it fact.
World foundation models imagine. NVIDIA’s Cosmos generates physically plausible video to train robots and drivers; the point is a plausible future, not a faithful past. Plausible is the right target for synthetic data. It is the wrong target for an audit.
Large geospatial models localise. Niantic’s LGM places a camera in real 3D space to centimetres; Google’s AlphaEarth Foundations compresses years of observation into a planetary embedding you can query. These read the real Earth.
Earth memory is the layer below all of them. Not a model you call, but a dated, signed record of what was actually observed at a place. It is the substrate the other three layers can be grounded on, and that you can train your own model on top of.
Chapter III
Who is building what.
The word “world model” covers two very different things: systems that generate a world, and systems that read one. The distinction is not marketing. It decides whether an output is a guess or a record.
| Model | What it is | What it does |
|---|---|---|
| Read the real Earth · observation-grounded | ||
| Niantic Spatial · LGM | A Large Geospatial Model, the spatial counterpart to an LLM, trained on 30B+ geolocated images. | Localises a camera to near-centimetre 6DoF and reconstructs places. Niantic itself positions LGM as “ground truth” for world models, so this is a peer in spirit, not a foil. |
| Google · AlphaEarth Foundations | An embedding-field model: a 10 m planetary embedding fusing optical, radar, elevation, climate, and text into one annual layer. | Observation-derived, not generative; the closest peer to what we do. We differ on provenance, temporal recall, private ownership, and training, not on “we observe, they imagine.” |
| Generate a world · plausible, not factual | ||
| Google DeepMind · Genie 3 | A generative world model: navigable 720p worlds at 24 fps with roughly a minute of held state. | Imagines interactive environments from a prompt. DeepMind is explicit that it cannot reproduce real-world locations accurately. It renders a plausible place, not a particular one. |
| World Labs · Marble | Fei-Fei Li’s system for persistent, explorable 3D worlds generated from text, image, or video. | Builds worlds you can walk through. They are coherent and persistent, and they are invented, not a record of any real address. |
| NVIDIA · Cosmos | World foundation models for physical AI, generating synthetic training data for robots and autonomous systems. | Produces physically plausible scenarios to train policies on. The output is manufactured supervision, deliberately not an observed record. |
Two camps, one shelf. Genie, Marble, and Cosmos generate plausible worlds and say so. LGM and AlphaEarth read the real one. We sit with the second camp and add the parts it leaves out: provenance you can verify, time you can replay, a memory you own, and the ability to train your own model on it. The sharpest line about the generative camp is the one DeepMind already published. A generated world is a guess about the frame after next, not a record of the one that happened.
Chapter IV
Grounding, and the hallucination problem.
A generative model that is wrong about geometry is a hallucination engine pointed at the world. Grounding is the fix. Tie every output to a verifiable observation, or do not assert it.
The clearest statement of the problem this year came from Overture Maps. Without ground truth, they wrote, models invent places that do not exist, miss places that do, and attach the wrong attributes to the right locations. Their prescription is a structured, verifiable grounding layer for spatial retrieval: stable place identifiers that resolve to checkable facts.
The world-model literature has its own version of the warning. The fair critique of generative world models is not that they are useless; it is that, left ungrounded, they behave as high-fidelity hallucination engines rather than reliable simulators. They produce dynamical hallucinations: water that flows uphill for a frame, a wall that was never poured. Convincing, and not something to settle an insurance claim on.
There is a parallel failure one level up, at the agent. Agents fabricate tool executions: they report that they checked a source they never queried. Researchers have proposed tool receipts: a signed artefact proving a tool actually ran and returned what the agent claims. That is precisely the shape of a geo.qa answer.
Chapter V
Where geo.qa sits.
Against the generative models, the contrast is record versus imagination. Against the observation-grounded ones it is sharper and more useful: a model you call versus a memory you own, cite, and train on.
| A geospatial model you call | geo.qa Earth memory | |
|---|---|---|
| Output | an inferred embedding or scene | an observed fact, dated |
| Provenance | black-box inference | a signed receipt, verifiable offline |
| Time | a snapshot or annual layer | every tslot, append-only |
| Fusion | a fixed set of inputs | your whole fleet, one address |
| Ownership | a hosted service | your tenancy, airgapped |
| To build on | a fixed, pretrained model | a substrate you train your own on |
We do not claim to be the first or only ground-truth layer. Niantic frames LGM that way for navigation; AlphaEarth is an open, observation-derived planetary embedding. Both are real, and good. The differentiation is not “we observe and they imagine.” For those two, that is false. It is in the things a planetary model you call does not give you.
Provenance you can cite. Every answer carries a signed receipt over its sources. You verify it offline against a public key; the embedding cannot hand you that.
Temporal recall. History is append-only. You can ask what one cell looked like on 4 May, then on 11 May, and diff the two. An annual layer only remembers the year.
Private ownership. The memory is your tenancy: your keys, your jurisdiction, airgapped if you need it. Not a multi-tenant API you rent.
Chapter VI
Train your own world model on the memory.
This is the difference that matters most. Most of the field ships one pretrained model. geo.qa lets you train your own on your own observed memory, and own both the data and the model.
# a JEPA dynamics model, trained on your own Earth memory from geoqa import Memory, train mem = Memory(tenant="acme") model = train.jepa( memory = mem, bands = ["optical", "SAR-VV", "thermal", "weather"], horizon = "7d", ) # predict in latent space, the way LeCun argues for, not pixels model.deploy(airgapped=True) # reads the same memory
The unified memory is not only something to query. Every cell × band × tslot fact is supervision: a dated, multimodal state of a real place, with its history intact. That is exactly the input a JEPA-style model wants. Yann LeCun’s argument is that the right move is to predict the next state in latent space, not to render the next pixel, and a memory of latents is the natural substrate for it.
So the loop closes. Fuse your fleet into one memory, train your own JEPA, forecasting, or detection model on it, and deploy that model back inside the same boundary, reading the same receipts it was trained on. You own the observations and you own the model. Nothing leaves, and every prediction can still be traced to what was actually seen. On a reference deployment we run, illustrative only, resolving a single cell takes single-digit milliseconds across a few million cells; read that as a figure from one box, not a quoted SLA.
Stop renting a model. Own the memory.
Let the labs imagine worlds. Build on the one that was actually observed, and train your own model on it.
geo.qa · a vortx ground decoder · emem.dev open protocol