Do your users trust your data? More importantly…do YOU trust your data?
How do you handle poor data quality?
I have read Data Vault 2.0 and really liked the idea of a business fingerprint. You don’t miss anything. However, at scale that comes as a cost. To get a business finger print, take a distinct across every dimension per row. At 50M rows, that significantly impacts query speed and efficiency.
With what I was working on, I didn’t want to pay that cost, but still wanted a viable solution. Here’s what I wound up doing:
- Selecting every dimension, grouping each column independently, then union-ing everything together for a clean result
- Along with that result, using a normalizing macro classifies every value:
- Mapped | clean, ready to report
- Unclassified | missing or junk (NULL, ‘N/A’)
- Needs Tagged | Poor data entry
The result: It’s very easy to identify and fix the bad data. It’s easy for the analyst to surface, and the data owner to go fix it.
Bottom line: You don’t have to follow the fanciest method out there. You just have to find what works for you and your situation.