April 20, 2026
•
AI & Machine Learning
What Is Engineering Data Infrastructure and Why AI Fails Without It
Most engineering AI pilots fail at the infrastructure layer, not the model. What engineering data infrastructure is and why it matters.
I have spent the last four years building data infrastructure for engineering teams at automotive, aerospace, and industrial manufacturers. In that time, I have watched dozens of AI pilots succeed and dozens more fail. The failures all have the same shape.
It is never the model.
It is the layer underneath, the infrastructure that should have prepared the data for AI in the first place. Most companies do not have a name for this layer, so they do not invest in it. And then they spend six months on an AI pilot that never makes it past the prototype stage. McKinsey’s 2025 State of AI research confirms this pattern at scale: nearly two-thirds of organisations remain stuck in pilot mode, and only about 5.5% are generating meaningful financial returns from AI.
This article covers what engineering data infrastructure actually is, the three foundation types engineering orgs currently operate on, and how to know which one you are on.
The problem engineers don’t name
Every engineering team I meet has the same problem. They just do not call it the same thing.
A VP of Engineering calls it "data silos." A CAE engineer calls it "manual preprocessing." A Head of R&D calls it "slow time to insight." A data scientist hired to build AI models calls it "our data isn’t ready yet."
They are all describing the same issue. Simulation data lives in solver outputs. Test data lives in measurement systems. Manufacturing data lives in MES and SPC databases. Design intent lives in PLM. Every one of these systems produces information the others cannot read, and every cycle, someone has to bridge the gap by hand.
When I ask VPs what this costs them, almost none have calculated it. When we do the math together on a call, engineers per team × hours per cycle × cycles per year × fully loaded cost, the number is consistently between $2M and $6M per year for a mid-size engineering org. And that is just the waste. It does not count the AI initiatives that do not get off the ground because the data is not ready.
What is "engineering data infrastructure"?
In software, infrastructure is well understood. Servers, databases, networks, the layer between the application and the hardware. You do not build an application and then figure out where it runs. You build on top of infrastructure that was designed first.
Engineering data has never had this. Every engineering team builds their data handling bottom-up, one workflow at a time, against whatever tools they happen to have licenses for. The result is not infrastructure. It is a pile of scripts and shared drives. NAFEMS, the international body for engineering simulation, has been documenting this pattern for two decades, they describe the typical shared-drive setup as "a huge digital landfill" when the structure is not enforced from the start.
Engineering data infrastructure means the layer that sits between the tools engineers use (solvers, test benches, PLM, MES) and the outputs engineers need (analysis, dashboards, AI models, reports). It extracts data from every source engineers generate it in, standardises schemas and units and metadata on ingest, stores results queryably rather than as flat files, and tracks lineage so you know which simulation generated which dataset.
The 3 levels of data infrastructure engineering organizations actually operate on
Most engineering organisations are running on one of three foundation types. Understanding which one you are on matters more than any tool decision you will make this year.
Level 1 - No infrastructure (shared drives + tribal knowledge)
Simulation results land in folders on a shared drive. Metadata lives in file names and spreadsheets. Every analysis starts by finding the right file, figuring out what conventions the last engineer used, and rebuilding preprocessing logic from memory. This is where most engineering orgs still operate. NAFEMS research confirms that across industry, usage of generic information systems like shared drives to manage simulation data is essentially universal among organisations that have not yet scaled simulation.
Pros
- Zero tooling cost and near-zero setup time.
- Engineers can keep using whatever solver workflow they already have.
- No vendor dependency or platform lock-in.
- Fine for a single team of 1–3 engineers running one program.
Cons
- Rebuild cycle every program, 3–6 hours per design variant.
- Knowledge walks out when engineers leave. Every team change is a reset.
- No queryable layer, every comparison is a manual file search.
- AI and surrogate modelling are not viable on top of this.
- Breaks completely at 5+ engineers or multi-program workloads.
Verdict: The default. Works for very small teams on short horizons. Fails the moment scale or AI ambition enters the conversation.
Level 2 - Partial infrastructure (in-house scripts + PLM extension)
A senior engineer builds Python scripts to automate exports. IT extends the existing PLM (Teamcenter, Windchill, 3DEXPERIENCE) to store simulation artefacts. This is where most organisations land when they outgrow shared drives but haven’t evaluated purpose-built options.
Pros
- Leverages existing PLM investment and licensing.
- Good lineage and governance for simulation artefacts stored there.
- Custom scripts solve 60-70% of the manual preprocessing burden for one team.
- Acceptable for organisations where simulation is tightly coupled to CAD/design data.
Cons
- PLM systems were built for CAD data, not simulation data, they struggle with multi-gigabyte solver outputs and proprietary formats.
- In-house scripts concentrate knowledge in 1–2 people, when they leave, the pipeline often dies.
- Deployments typically take 12–18 months and run over budget.
- Does not extend to test data or manufacturing data, creates new silos.
- User experience for CAE engineers is often poor, they avoid the system, defeating the purpose.
Verdict: Bridge solution. Works for organisations with heavy PLM investment and moderate simulation data volume. Rarely the foundation that scales to AI readiness.
Level 3 - Full infrastructure (purpose-built engineering data layer)
Purpose-built platforms designed specifically for engineering data, simulation, test, and manufacturing. Connects directly to Ansys, Siemens Simcenter, Altair, Hexagon, OpenFOAM, and test benches. Standardises schemas, units, and metadata on ingest. Stores results in a queryable layer. Workflows are reusable across programs. This is the foundation the 5% of engineering AI pilots that reach production are running on.
Pros
- Engineering-native: built for solver outputs, not CAD files.
- Connects to existing tools without changing how engineers work.
- Deployment in weeks, not 12–18 months.
- Knowledge survives team changes, workflows are institutional assets.
- Queryable data makes AI, surrogates, and dashboards actually viable.
- Typical measurable ROI within one program cycle.
-
Cons
- Newer category, fewer public case studies than PLM extensions.
- Requires initial setup to connect solver environments and define canonical schemas.
- Not right for teams of 1–3 engineers on a single program.
- Paid platform (though typically 5–10× cheaper than PLM extensions).
Verdict: The foundation engineering organisations building for the next decade are investing in. Key Ward is in this category.
Every engineering AI pilot we see has a data infrastructure problem
The 95% pilot failure stat is widely cited but rarely interrogated. When you look at why pilots fail, there is a pattern.
The data scientist hired to build the model spends 80% of their time not on the model but on data prep, extracting from solvers, reformatting, aligning variables, filling gaps. The pilot works, narrowly, on one dataset. Then the team tries to extend it to a new program. None of the preparation logic carries over. The next engineer has to do it all again, from scratch, differently.
This is not a model problem. The model was fine. It is an infrastructure problem. McKinsey’s research on scaling AI in manufacturing puts it bluntly: companies cite workflow, data, and operating-model blockers as the primary reasons AI stalls at pilot stage, not model capability. The 5% of pilots that reach production all built the infrastructure first. Then the models were easy.
This is the same conclusion we document in our post on why 95% of engineering AI pilots fail, the separator between the successful 5% and the failing 95% is almost always the foundation that was built, or not built, before the pilot started.
How to overcome the engineering data infrastructure problem?
If you are reading this as a VP or a Head of R&D, the first thing to do is not buy anything. It is to calculate the real cost of not having this layer. Take a single engineering program, count the people, count the hours spent on data prep, multiply it out over the year. The number is almost always larger than the AI budget being debated in the same room.
Then pick one workflow, the one that costs your team the most per cycle, and build the infrastructure for that one workflow first. Do not try to do everything. This aligns with NAFEMS deployment guidance: start with a bounded scope, prove value, expand from there. Attempting a full enterprise rollout from day one is the most common failure pattern across 20 years of simulation data management projects.
For the full picture of what the infrastructure layer does for specific workflows, see our CAE data management deep-dive or the test data management post.

Stop running AI pilots on infrastructure that can’t support them
Two ways to move forward, pick what fits where you are.



