The Stress Harness — The Dispatch

Years ago I was part of a project where we paid a Big Three consulting firm to build a data product for us. It cost serious money. It took months. When they handed it over, it didn’t work. Not “needs some polish” didn’t work. Didn’t work in ways that mattered. We sat around a table beating our heads against the wall trying to make it function under real conditions.

Eventually I put all the data on the table and said: this didn’t work. Full stop.

We moved on. But I never forgot what it felt like to take someone’s word for “it’s ready” and be wrong.

So when I got close to shipping OIM, I had a voice in the back of my head. The doubting engineer. The one who’s seen enough delivered software to know that “it works in dev” is not the same as “it works.”

OIM processes raw ERP exports and builds a dimensional data model on a laptop, no server, no cloud, no database administrator. That is a real promise. I needed to know it was actually true before I handed it to a customer.

The stress harness is what I built to answer that question.

The design is a 25-run adversarial test built around four phases. Not 25 random runs. 25 runs that simulate a skeptical early adopter putting the product through real conditions.

Phase 1 — Warm-up (runs 1-6). Small loads. First-ever run on a blank slate. Re-run the same data to test deduplication. Clean slate again. The behavior a cautious user exhibits when they are not sure they trust the software yet.

Phase 2 — Normal cadence (runs 7-16). Steady weekly drops at scale. 1 million rows. 5 million rows. Back to 1 million. The rhythm of a real production environment.

Phase 3 — The surprise (run 17). A massive 25 million row historical backfill. The thing nobody plans for but every customer eventually does. “We want to load the last two years.” This is where software breaks.

Phase 4 — Recovery (runs 18-25). Small drops after the big dump. Back to normal cadence. Does the system recover cleanly or does the large run leave junk behind?

Each run checks the same list:

Sentinel drops cleanly every time
Runtime stays predictable, no drift
Data quality output stays consistent
Nothing accumulates across runs that shouldn’t
Health score stays in expected range
Gold manifest is valid
Run log grows by exactly one row per run
No orphaned files or temp tables

The harness does not accept partial credit.

Here are the results.

Run	Phase	Target	Action	Duration	Rows	Health Score
1	Warm-up	50K	Clean	14.7s	50,585	89.02
2	Warm-up	50K	Keep (dedup)	10.6s	50,585	89.02
3	Warm-up	50K	Clean	11.4s	50,499	89.05
4	Warm-up	250K	Clean	29.2s	252,751	88.60
5	Warm-up	250K	Keep (dedup)	26.8s	252,751	88.60
6	Warm-up	50K	Clean	12.6s	50,442	88.95
7	Normal cadence	1M	Clean	51.5s	1,011,819	88.44
8	Normal cadence	1M	Keep (dedup)	45.4s	1,011,819	88.44
9	Normal cadence	1M	Clean	67.5s	1,010,862	88.35
10	Normal cadence	1M	Clean	50.8s	1,011,568	88.44
11	Normal cadence	5M	Clean	181.7s	5,056,452	88.24
12	Normal cadence	5M	Keep (dedup)	167.9s	5,056,452	88.24
13	Normal cadence	5M	Clean	166.7s	5,057,285	88.24
14	Normal cadence	1M	Clean	65.7s	1,011,760	88.36
15	Normal cadence	5M	Clean	179.9s	5,057,987	88.25
16	Normal cadence	5M	Clean	175.7s	5,058,227	88.17
17	The surprise	25M	Clean	975.1s	25,295,142	88.12
18	Recovery	1M	Clean	74.6s	1,010,909	88.43
19	Recovery	1M	Keep (dedup)	68.2s	1,010,909	88.43
20	Recovery	250K	Clean	40.7s	252,493	88.76
21	Recovery	5M	Clean	178.1s	5,060,132	88.23
22	Recovery	5M	Keep (dedup)	160.1s	5,060,132	88.23
23	Recovery	1M	Clean	66.8s	1,011,114	88.45
24	Recovery	5M	Clean	177.4s	5,060,288	88.22
25	Recovery	1M	Clean	66.7s	1,011,185	88.47

25/25 passed. 75.8 million rows processed. Total elapsed: 76 minutes.

A few things worth noting in that table.

The dedup runs hold. Run 2 re-runs the exact same 50K rows as Run 1. Same row count out. Same health score. Nothing doubles. The same logic holds at 1M (Runs 7/8) and 5M (Runs 11/12, 21/22). Deduplication on work_order_id + operation_sequence, latest export timestamp wins, works every time.

Run 17 is the one that would break a system that wasn’t designed for it. 25 million rows. 975 seconds. Not fast, but correct. The health score comes out at 88.12. Compare that to Run 1 at 89.02. Sixteen months of production scale and the data quality signal barely moves.

Recovery is clean. Run 18 comes after the 25M dump. It runs a 1M clean drop. Seventy-four seconds. Health score 88.43. The system didn’t accumulate anything from the big run that should have been cleared. No orphaned files. Log grows by one row.

What I was really looking for was drift. Does the runtime creep? Does the health score degrade? Does something leak across runs that poisons the next one?

None of that happened.

The health score range across 25 runs is 88.12 to 89.05. Less than a point of variance across 75 million rows in four different phases. That is not an accident. That is what consistent data quality logic looks like under pressure.

I went back and read the notes from that consulting engagement years later. The problem was not that the consultants were incompetent. The problem was that nobody had ever put the software through conditions it would actually face. It passed the demo. It failed the floor.

The stress harness exists so that does not happen with OIM.

When I hand this to a customer and they say “does this actually work,” the answer is not “we think so.” The answer is 25/25. Seventy-six million rows. Laptop only. Every check passed.

That is what ready means.