When importing data into your data warehouse, you will almost certainly encounter data quality errors at many steps of the ETL pipeline.
An extraction step may fail due to a connection error to your source system. The transform step can fail due to conflicting record types. And even if your ETL succeeds, anomalies can emerge in your extracted records – null values throwing off stored procedures, or unconverted currencies inflating revenue.
How do you catch these errors proactively, and ensure data quality in your data warehouse?
Here at Bolt we have experience automating processes to ensure that the data in your warehouse stays trustworthy and clean. In this post we outline 7 simple rules you can use to ensure data quality in your own data warehouse. We used rules like these at Optimizely with great results.
Rule 1: COUNT of new records added each day > 0
A common data anomaly analysts encounter is the output of their report suddenly dropping to 0 (like the chart above).
By far, the most common culprit I found for this was that there were no new records added to the respective table that day.
An easy way to check for this is to ensure the COUNT of new records in a table every day is > 0. If there are missing records on a given day, it’s likely there was an error in your extraction step of the ETL.
Rule 2: Percent of new records with NULL or 0 values < 99%
The second most common culprit I’ve encountered for a drastic drop in a report is due to NULL or 0 values.
To check for this, simply set a recurring script to COUNT the number of new records in a table every day with NULL or 0. If you see a drastic increase in the number of records with such values, it is possible there was a transformation error or some upstream anomaly in your source system.
Rule 3: COUNT of new records added each day = 7-day avg +- 25%
More often than not, you’ll find a huge spike or dip in your traffic or revenue. Rules 1 and 2 check out, and you know the issue isn’t due to missing or NULL records.
There are a multitude of explanations for such fluctuations. There were duplicate records fetched from a source system. Or the transformation step of your ETL failed for a specific segment of users. Or it could be that there was a legitimate increase.
Still, you should put a data quality rule in place to at least check when these fluctuations occur, and diagnose them proactively. A simple SQL procedure you can automate is to check if the COUNT of new records every day is within a margin of error of the 7-day trailing average. The threshold and margin of error may differ by company and product, but +-25% often works as a good rule of thumb.
Rule 4: SUM of new records = 7-day avg +- 25%
Another data quality check to catch fluctuations in record values is to monitor the sum of the values in a table.
If the SUM of new record values increases beyond a margin of error of a 7-day trailing average, it’s possible that there is something wrong. You wouldn’t expect your MRR to spike by > 25% in single day, and checking for such occurrences can help. Most often such errors I encountered were the result of a new data type or duplicate record in the source system – currency conversions or duplicate invoices inflating your revenue numbers.
Rule 5: SUM of new records fluctuates <= 100% previous day SUM
If daily volatility in numbers is more common for your product or company, Rule 4 may not work as well and may provide false positive checks.
An iteration on the rule is to instead check for day over day changes in your reports. Or simply check that the SUM of new record values does not increase more than 100% from the previous day.
Again it is totally possible that you could experience legitimate daily spikes in metrics like traffic or page visits. But as your product scales, fluctuations of >= 100% are less likely, and more indicative of some anomaly in your ETL pipeline.
Rule 6: Unique Identifier A != Unique Identifier B
Another common culprit for drastic increases in metrics are duplicate records.
A simple but effective constraint to put in place is to ensure that there are no records in a table with the same unique identifier. Your users table should never have two records with the same email address or user id. Your invoices table should never have two records with the same invoice number.
For tables with time series data, it is possible that multiple records will share the same unique identifier (user id, account id, etc). In this case, the data quality check will have to be adapted, but some variation of checking for uniqueness on user-object-timestamp usually works well.
Rule 7: created_at OR updated_at <= NOW()
The last rule you can automate to check for is records with future timestamps.
With some exceptions (like invoices), it is highly unlikely that you would have records in your data warehouse with date values in the future. Records like signup_date or activity_date should have already occurred and thus have timestamps with a past date.
Checking that your records have a created_at or updated_at date <= NOW() is a good procedure to ensure that such anomalies do not emerge. Make sure however that you account for time zones in your automated rule (i.e. if your data warehouse is in UTC, and you’re based in PST, your records could be 7 hours ahead of the current time).
These are just 7 of the most common errors we’ve encountered in maintaining a data warehouse. You can easily protect yourself against them by implementing the rules above as daily checks, for example in a nightly batch job that sends out a report of the outcome. These checks will help you catch ETL issues early, proactively identify and analyze anomalies, and strengthen your and your team’s confidence in your data and reports.
In future posts, we’ll outline some more complex anomalies and how to check for them.
Want more insights on data? Check out the 2016 Data Fragmentation Report that uses BuiltWith to survey technology adoption across B2B, Shopping, Travel and other verticals.
- See how data sources have grown over 3 years
- Compare CRM, Analytics, Payment, Ad platform adoption
- Discover how 9 different industries are affected