Skip to main content

World models, and where memory sits.

Everyone agrees 2026 is the year of the world model. Fewer agree on what a world model is for. This is our reading of the field, and a precise claim about the layer underneath it: a memory of the real Earth that a model can cite by date and train on.

Imagine vs remember
Grounded · cited
Train your own

Chapter I

The year of the world model.

The labs that set the agenda all moved in the same direction this year: past language, toward systems that hold a model of physical space.

Fei-Fei Li has spent the year arguing that spatial intelligence is the next frontier of AI. The field has mastered words; it is only beginning on the harder problem of space, geometry, and the physical world. Her new lab, World Labs, exists to build exactly that, and raised a billion dollars in February on the thesis.

Jensen Huang put it as a layered stack at CES, a five-layer cake of AI in which the most valuable models are the ones that understand the physical world, not just text on a screen. NVIDIA’s own answer, Cosmos, is a family of world foundation models aimed squarely at physical AI.

Goldman Sachs, writing for people who allocate capital rather than train networks, called the world model AI’s missing link: the reason a system that aces a language benchmark still cannot be trusted to act in a warehouse, a field, or a flooded street.

We agree with the direction and want to be precise about one thing. A world model that imagines a plausible world is a different object from a record that remembers the real one. Both are useful. They are not the same, and an agent acting on the second mistaken for the first is how you get fluent, dated, confident error.

# the distinction this memo turns on imagine → a plausible next frame. (a guess, by design) remember → a dated observation, cited. (a record, with a receipt) # geo.qa is the second, built to sit under the first
Spec. 1. The whole argument in two verbs. Generation and memory are complementary; conflating them is the failure mode.

Chapter II

One stack, four jobs.

Niantic’s own argument is that the best systems combine these layers rather than pick one. Each does a distinct job, and each needs something to be true about a real place.

Large Language Modelsreason in language
reason
World Foundation Modelsimagine plausible physics
imagine
Large Geospatial Modelslocalise in real 3D space
localise
↑ grounded on ↑
geo.qa · Earth Memoryremember, cite & train on what was observed
ground truth

Language models reason. Give them the right context and they plan, summarise, and decide well. They hold no model of a specific place, so when the context runs out they reach for a prior and call it fact.

World foundation models imagine. NVIDIA’s Cosmos generates physically plausible video to train robots and drivers; the point is a plausible future, not a faithful past. Plausible is the right target for synthetic data. It is the wrong target for an audit.

Large geospatial models localise. Niantic’s LGM places a camera in real 3D space to centimetres; Google’s AlphaEarth Foundations compresses years of observation into a planetary embedding you can query. These read the real Earth.

Earth memory is the layer below all of them. Not a model you call, but a dated, signed record of what was actually observed at a place. It is the substrate the other three layers can be grounded on, and that you can train your own model on top of.

Chapter III

Who is building what.

The word “world model” covers two very different things: systems that generate a world, and systems that read one. The distinction is not marketing. It decides whether an output is a guess or a record.

ModelWhat it isWhat it does
Read the real Earth · observation-grounded
Niantic Spatial · LGMA Large Geospatial Model, the spatial counterpart to an LLM, trained on 30B+ geolocated images.Localises a camera to near-centimetre 6DoF and reconstructs places. Niantic itself positions LGM as “ground truth” for world models, so this is a peer in spirit, not a foil.
Google · AlphaEarth FoundationsAn embedding-field model: a 10 m planetary embedding fusing optical, radar, elevation, climate, and text into one annual layer.Observation-derived, not generative; the closest peer to what we do. We differ on provenance, temporal recall, private ownership, and training, not on “we observe, they imagine.”
Generate a world · plausible, not factual
Google DeepMind · Genie 3A generative world model: navigable 720p worlds at 24 fps with roughly a minute of held state.Imagines interactive environments from a prompt. DeepMind is explicit that it cannot reproduce real-world locations accurately. It renders a plausible place, not a particular one.
World Labs · MarbleFei-Fei Li’s system for persistent, explorable 3D worlds generated from text, image, or video.Builds worlds you can walk through. They are coherent and persistent, and they are invented, not a record of any real address.
NVIDIA · CosmosWorld foundation models for physical AI, generating synthetic training data for robots and autonomous systems.Produces physically plausible scenarios to train policies on. The output is manufactured supervision, deliberately not an observed record.

Two camps, one shelf. Genie, Marble, and Cosmos generate plausible worlds and say so. LGM and AlphaEarth read the real one. We sit with the second camp and add the parts it leaves out: provenance you can verify, time you can replay, a memory you own, and the ability to train your own model on it. The sharpest line about the generative camp is the one DeepMind already published. A generated world is a guess about the frame after next, not a record of the one that happened.

Chapter IV

Grounding, and the hallucination problem.

A generative model that is wrong about geometry is a hallucination engine pointed at the world. Grounding is the fix. Tie every output to a verifiable observation, or do not assert it.

The clearest statement of the problem this year came from Overture Maps. Without ground truth, they wrote, models invent places that do not exist, miss places that do, and attach the wrong attributes to the right locations. Their prescription is a structured, verifiable grounding layer for spatial retrieval: stable place identifiers that resolve to checkable facts.

The world-model literature has its own version of the warning. The fair critique of generative world models is not that they are useless; it is that, left ungrounded, they behave as high-fidelity hallucination engines rather than reliable simulators. They produce dynamical hallucinations: water that flows uphill for a frame, a wall that was never poured. Convincing, and not something to settle an insurance claim on.

There is a parallel failure one level up, at the agent. Agents fabricate tool executions: they report that they checked a source they never queried. Researchers have proposed tool receipts: a signed artefact proving a tool actually ran and returned what the agent claims. That is precisely the shape of a geo.qa answer.

# an ungrounded answer: fluent, unverifiable ask was this parcel underwater on 04 May? → "No flooding on record for that date." (invented) # a grounded answer: same question, with a receipt ask was this parcel underwater on 04 May? → flood_extent = 0.0 across 4 observations cell 8a1f0b27a… · NDWI · SAR-VV · optical · 2026-05-04 receipt 0x9f3c… · ed25519 signed · verify offline
Spec. 2. A receipt is the agent’s tool receipt for the physical world: ed25519 over canonical CBOR, BLAKE3 hashed, checkable against the public key without calling us.

Chapter V

Where geo.qa sits.

Against the generative models, the contrast is record versus imagination. Against the observation-grounded ones it is sharper and more useful: a model you call versus a memory you own, cite, and train on.

Plate v · one cell, decomposedband × tslot
Fig. v. A planetary embedding gives you one vector per place. A memory gives you every band at that place across every tslot, and the receipt for each.
A geospatial model you callgeo.qa Earth memory
Outputan inferred embedding or scenean observed fact, dated
Provenanceblack-box inferencea signed receipt, verifiable offline
Timea snapshot or annual layerevery tslot, append-only
Fusiona fixed set of inputsyour whole fleet, one address
Ownershipa hosted serviceyour tenancy, airgapped
To build ona fixed, pretrained modela substrate you train your own on

We do not claim to be the first or only ground-truth layer. Niantic frames LGM that way for navigation; AlphaEarth is an open, observation-derived planetary embedding. Both are real, and good. The differentiation is not “we observe and they imagine.” For those two, that is false. It is in the things a planetary model you call does not give you.

Provenance you can cite. Every answer carries a signed receipt over its sources. You verify it offline against a public key; the embedding cannot hand you that.

Temporal recall. History is append-only. You can ask what one cell looked like on 4 May, then on 11 May, and diff the two. An annual layer only remembers the year.

Private ownership. The memory is your tenancy: your keys, your jurisdiction, airgapped if you need it. Not a multi-tenant API you rent.

Chapter VI

Train your own world model on the memory.

This is the difference that matters most. Most of the field ships one pretrained model. geo.qa lets you train your own on your own observed memory, and own both the data and the model.

train.py
# a JEPA dynamics model, trained on your own Earth memory
from geoqa import Memory, train

mem = Memory(tenant="acme")

model = train.jepa(
    memory = mem,
    bands  = ["optical", "SAR-VV", "thermal", "weather"],
    horizon = "7d",
)

# predict in latent space, the way LeCun argues for, not pixels
model.deploy(airgapped=True)  # reads the same memory
train.jepa(memory)predict the next latent state of a place: a world model of your own ground, learned in latent space rather than pixels
train.forecast(band)project a band’s trajectory forward, cell by cell
train.detect(classes)your own categories on your own fused imagery
model.deploy()serve it inside your tenancy, citing the same receipts
model.evaluate()score against held-out tslots; nothing was overwritten

The unified memory is not only something to query. Every cell × band × tslot fact is supervision: a dated, multimodal state of a real place, with its history intact. That is exactly the input a JEPA-style model wants. Yann LeCun’s argument is that the right move is to predict the next state in latent space, not to render the next pixel, and a memory of latents is the natural substrate for it.

So the loop closes. Fuse your fleet into one memory, train your own JEPA, forecasting, or detection model on it, and deploy that model back inside the same boundary, reading the same receipts it was trained on. You own the observations and you own the model. Nothing leaves, and every prediction can still be traced to what was actually seen. On a reference deployment we run, illustrative only, resolving a single cell takes single-digit milliseconds across a few million cells; read that as a figure from one box, not a quoted SLA.

Stop renting a model. Own the memory.

Let the labs imagine worlds. Build on the one that was actually observed, and train your own model on it.

geo.qa · a vortx ground decoder · emem.dev open protocol