Your Data Is Dirty and That's Fine

Every manufacturer who hears about the OIM eventually asks the same question: what happens if our data is messy?

The honest answer is that it doesn’t matter. Not because OIM ignores bad data, but because bad data is the default assumption. OIM was designed from the beginning for data that arrives broken, incomplete, or inconsistent. The question is not whether your data is dirty. The question is how far into the pipeline the problem gets before someone is told about it clearly.

ERP exports break in predictable ways.

The wrong report gets exported. A column gets renamed after an upgrade. Someone exports an empty date range and doesn’t notice. Someone drops two months of data when only one was needed. A Windows machine saves a file with a silent encoding mark that mangles the first column name. An operation’s end time is earlier than its start time because of a timezone mismatch.

None of these are edge cases. They are the normal operating conditions of a manufacturing floor where ERP administration is a part-time responsibility and nobody has time to audit every export before it drops into the folder.

The design philosophy for OIM’s ingestion layer is borrowed from the product’s core premise: the person running this is not a data engineer. They are a line supervisor, an ops admin, or whoever has their hand on the mouse that day. When something breaks, the system owes them an explanation they can act on, not a Python traceback they can’t read.

Pre-flight validation runs before OIM touches a single row of data.

There are fourteen checks. They run in order. The first failure stops everything, raises a plain-English error, and tells the user exactly what happened, why it usually happens, and what to do next. No partial writes. No corrupt output. The gate holds.

A few that matter:

Encoding detection. Before DuckDB reads anything, the first four bytes of every file are checked. An Excel file renamed to .csv has magic bytes that identify it immediately. UTF-8 BOM — a hidden formatting mark that older Windows software prepends silently — will mangle the first column name in a way that produces a misleading “missing column” error downstream if not caught here. The check doesn’t guess. It reads the bytes and names the problem.

Report header detection. Many ERP systems prepend report title rows before the actual column headers. “Epicor Kinetic — Production Operations Report / Generated: 2024-01-15” sitting on top of your data is not a problem OIM passes back to the user. The pipeline scans forward row by row until it finds the row where the required columns are present, records the skip count per file, and proceeds. If two files in the same batch have different skip counts — different export configurations — the error names the specific files and explains why they can’t be loaded together.

Grain integrity. Every row needs a work_order_id and operation_sequence. These are the composite key that the dedup logic runs on. A NULL in either field doesn’t produce an obvious error — it produces silent data loss, because dedup treats NULLs as equal and collapses all NULL-grain rows into a single survivor. The check catches this before the write and routes it back to the user.

Reversed timestamps. state_end_time before state_start_time. This happens. Timezone mismatches, midnight crossings, planned end times mixed with actual start times. Duration-based KPIs — utilization, OEE, shift analysis — are downstream of these columns. A negative duration doesn’t raise an error. It flows to the Silver layer as a negative number and quietly deflates every metric that touches it. The validation catches it at the gate.

The full list covers twelve more scenarios. The point is not the length of the list. The point is that every check was written in response to something that actually happens in production ERP environments.

Validation gets you clean data at the gate. But some problems survive validation.

A machine name might be present, properly formatted, and still carry no usable information. TBD. UNKNOWN. PLACEHOLDER. A field that someone meant to fill in and never did. A field that the ERP defaulted to a junk value when no selection was made.

These values pass validation because they aren’t structurally broken. But they can’t be used for anything either.

This is where normalization comes in.

OIM runs a macro called normalize_val() on every dimensional value before anything gets mapped. The macro processes each value through a strict, ordered set of conditions and returns one of three things: a clean identity key ready for downstream mapping, or one of two sentinel values that identify why it failed.

'unclassified' — the value was NULL, a ghost string, or matched a junk identity word. NULL shouldn’t be possible in a well-configured ERP, so its presence signals something beyond a simple missing tag. This one gets routed to whoever owns data integrity.
'needstag' — a value existed but stripped down to nothing after alphanumeric reduction. Something was attempted in the ERP but carries no usable identity. This one gets routed to an operator: go find the machine and tag it.
A clean key — the value normalized successfully and is ready for mapping.

The distinction between the two failure sentinels is the part that matters most.

'Needs Tagged' is an action item. One field, one machine, one trip to the ERP. An operator can fix it in five minutes. 'Unclassified' is an investigation. Something is wrong at a level that requires someone with data access to understand before any action makes sense.

Collapsing both into a single “Action Required” status loses the routing signal entirely. The operator sees a list of things to fix and doesn’t know which ones are five-minute jobs and which ones require a call to IT. The distinction is not cosmetic. It determines whether the right person is looking at the right problem.

The downstream result of this is a mapping_status on every dimension record. Mapped. Needs Tagged. Unclassified. Every record has one. The data quality views in OIM surface them by facility, by production line, by machine, by operation. The plant manager can see at a glance how much of their data is clean, how much needs a simple tag, and how much needs deeper investigation.

This is what it means to handle dirty data gracefully. Not pretend it isn’t there. Not fail silently and let it corrupt the numbers. Surface it. Classify it. Route it to the person who can fix it.

The data quality problem in manufacturing is not a technology problem. It is a visibility problem. Nobody fixes what they can’t see. The OIM makes it visible.