The data flywheel we haven’t built yet

Construction knows it needs to collect data.

It doesn’t yet know how.

Not really. We generate enormous amounts of it: models, drawings, specs, site observations, sensor feeds, cost records. But most of it is never captured with intent. It accumulates in silos, in formats that don’t talk to each other, carrying labels that mean different things to different teams. We haven’t agreed on what “quality” means for a construction dataset. We haven’t mapped the relationships between processes, outcomes, and the decisions that connect them.

Karol Hausman, co-founder of Physical Intelligence, spent the last few years solving a deceptively similar problem for robots. To teach a robot to pick up a cup, you can’t simulate the physics of the world; it’s too complex, too contextual. Instead, you collect real-world data, reach a deployable threshold, and then let deployed robots collect more data while doing useful work. Models improve. More robots are deployed. More data flows in. A flywheel.

The insight isn’t about robots. It’s about what happens when you finally understand what data you actually need and start collecting it with that in mind.

He also learned something humbling: “quality data” and “diverse data” are easy to say and almost impossible to define until you start building. You don’t theorise your way to a definition. You run experiments, the definitions sharpen, and you double down on what works.

AEC needs to go through exactly that process.

That’s the ontology problem. Not a technical problem, a literacy problem.

The industry needs to learn what it means to define a thing cleanly, to connect it to related things, and to make those connections machine-readable. That work isn’t glamorous. It doesn’t ship as a product. But it’s the substrate everything else runs on. Without it, AI can chat about your documents. It cannot reason about your buildings, your projects, or your institutional knowledge.

We can start now without waiting for perfect infrastructure. The unstructured data is already there: documents, conversations, drawings, reports. Frameworks like MCP let us start surfing it today. But surfing unstructured data and building structured knowledge aren’t alternatives. They’re phase one and phase two of the same loop.

Phase one: learn what’s in the data you already have.

Phase two: capture the next project’s data using the definitions from phase one.

Each project feeds the next.

That’s the flywheel construction that hasn’t been built yet.

The data flywheel we haven't built yet