Simple Data Quality guidelines
Source: The 1:10:100 rule of data quality · Andrew Jones (andrew-jones.com)
Step 1: Measuring and Evaluating Data Quality — Form Baseline.
Data quality dimensions -
Accuracy, Reliability (Timeliness), Completeness, and Usability
Data accuracy is the degree to which data correctly represents the real-world events or objects it is intended to describe (DAMA).
How to measure Accuracy?
- Validate pre-agreed and defined enumerations are used. This includes check for spelling, case.
- Validate the data type and data match. Example using digits in string data type should be avoided.
- Validate same meaning of value is used at all places. Inconsistent use of values in different attributes.
- Date format reflects reality, avoid country specific syntax
- Relationship- Example If account should always have a customer, then that relationship should reflect accurately.
The degree to which data represent reality from the required point in time.
How to measure Timeliness?
- Define the SLA for your dataset.
- Measure how data shared measures against the SLA.
- Data freshness check. Check if stale data is not being shared.
The completeness data quality dimension is defined as the percentage of data populated vs. the possibility of 100% fulfillment.
How to measure Completeness?
- Null attribute check
- Missing reference data check.
- Data is not missed when consumed from upstream system or while creating record.
- Data is consistent with value in upstream system or while creating record.
- Data matches the schema structure.
Usability, Ease of finding information about dataset.
How to measure Usability?
- Documentation about the dataset including description, enumerations.
- Metadata — Data owner; schema.
Step 2: Create Data Profile.
- Identify critical data sets and elements. These critical data elements are key for providing value to your consumers to generate insights or building data products. The key thing to understand is that it’s about not governing all data everywhere up to the same standards, but instead focusing on that particular data that is most strategic, most important.
- Generate data profile and define refresh cycle.
Step 3: Identify Data Quality Issues
Step 4: Build Data Quality framework and automated DQ tests.
- Automatically Create Data Quality Tests
- Provide capability to build business rules.
- Define and create DQ scores.
- Share Data Quality status and make it actionable.
Above is a very simple guidelines for managing data quality of your datasets.