First 50M Row Run — The Dispatch

The first successful run was 2.5 million rows.

One clean parquet. Thirteen tables out. Dimension and fact tables. Full Kimball. A mapping table with a business fingerprint for lossless architecture. Bronze to Silver to Gold, the full medallion stack, running locally on a laptop with no server, no cloud, no infrastructure beyond a Python virtual environment.

I watched the terminal print the export manifest and felt something I had not felt in a while. It actually worked.

That first run was proof of concept. The architecture held. The pipeline did what it was supposed to do. DuckDB ingested the raw data, built the dimensional model, and exported 13 Parquet files to a gold layer that any BI tool could read.

But 2.5 million rows is not a stress test. A mid-market manufacturer running two or three facilities and a few years of history can generate that in a month. The question was whether the architecture would hold at real scale. Not demo scale. Not “it works on my machine” scale. The scale a customer actually brings on day one when they say “we want to load everything.”

That number is 50 million rows. Ten gigabytes of CSVs. De-dupe pressure. Extra columns. Extra header rows. Schema variations across files. The kind of data that comes out of a real ERP export that nobody has cleaned.

The first attempt at 50 million rows took an hour and forty minutes.

That is not a failure. It ran. It produced correct output. Every table built, every dimension populated, every fact row accounted for. But an hour and forty minutes is a long time to watch a progress bar. Long enough that a customer might wonder if something is wrong. Long enough that running a second pass after fixing a mapping issue becomes a half-day exercise.

The bottleneck was the ingestion layer. The original pipeline was written in Python with pandas doing the heavy lifting on the CSV reads and transformations. Pandas is not designed for this. It reads row by row, holds everything in memory, and struggles under the kind of column-level aggregation that a dimensional model requires.

DuckDB was already in the stack. The answer was to push more of the work into DuckDB itself.

The rewrite moved the core ingestion logic from pandas to native DuckDB SQL. Instead of reading CSVs into DataFrames and transforming them in Python, the pipeline now reads CSVs directly into DuckDB, runs the transformation and deduplication logic in SQL, and writes the output in a single pass.

DuckDB is purpose-built for this. Columnar storage, vectorized execution, predicate pushdown. It does not read rows. It reads columns, and only the columns it needs. The aggregation and dedup logic that took pandas minutes runs in seconds.

The same 50 million rows that took an hour and forty minutes came down to under forty minutes. A 2.5x improvement without changing the architecture, without adding infrastructure, without touching the output format.

The numbers from the 50M run:

50 million rows ingested from raw CSVs
10 gigabytes of input data
13 tables exported to Parquet
De-dupe, error handling, extra columns, extra header rows, all handled
Under two hours on a laptop. Then under forty minutes.

Production grade analytics off of CSVs.

The reason this matters is not the number itself. It is what the number means for a manufacturer who cannot justify a server.

The conversation with enterprise software vendors always ends the same way. You need more compute. You need a cloud environment. You need a data warehouse. You need a team to manage it.

The OIM’s answer is: no you don’t. You need a laptop and your ERP export. That’s it.

That is not a compromise. That is the point. The architecture was designed from the beginning to run where the customer already is, on hardware they already own, without asking them to buy anything new.

The 50M run was not a benchmark. It was a proof that the promise was real.