Every organization needs data. It is often said that the data is the new oil fuelling the economy. This is certainly true. Organizations spend significant portions of their money and efforts to acquire and manage data. But the quality of the data is paramount to their investments. If the data is of poor quality, or is stored and managed in an inefficient way, its value to the organization is diminished.
Risks associated with poor data quality
Data quality, as part of the data management programme in an enterprise should be a concern of both technology and the business organization. The quality of the data should be clearly defined and required by the business, while the technology should ensure that it is implemented in an efficient and manageable way.
What are the risks associated with poor quality of the data? Some of them are listed below:
- Regulatory - Some organizations have an obligation to report on their operations to the regulatory bodies. If the data quality does not match the requirements, regulators can impose heffy fines on the organizations.
- Operational - If an organization does not have accurate data on various aspects of its operation, there is a certainty that mistakes will be made, rework will be required and more staff than necessary will be needed to perform organization's activities.
- Strategic - Enterprises collect and process data that helps them to understand the economic environment they thrive in. If the data is inaccurate, the organizations may miss opportunities to innovate and adopt strategies allowing them to beat the competition.
The risks described above imply that the data quality, and data management in broader sense should be a responsibility of the whole organization, not only its technology division. Technology should provide means to efficiently acquire, store, process and retire data. Depending on the nature of the data stores, various tools can help the organization to perform these operations.
Data quality through comparison
On a very low, technology level, data quality verification is to ensure that the data attributes match certain criteria, either on a detail or aggregate level. The verification can be implemented as a data comparison against an agreed baseline or template, and any diversions from the expected results of comparison should be flagged to the appropriate data management professionals.
Some of the checks can be implemented as direct comparisons of data attributes, but of course data quality is a much broader concept. Certain quality criteria require that particular measure is within certain boundaries, whether particular process produces appropriate number of rows etc - effectively, the metadata parameters are tested.
Most of such scenarios can be tranformed into tabular data sets that can be compared with predefined templates using declarative syntax of SQL. There are advantages of this approach:
- Tabular data are easily stored in databases or CSV files, facilitating further analysis and troubleshooting.
- SQL is a well known standard for which there is plenty of talent on the market.
- Additional controls can be added to existing data sets without requirement of writing elaborate code.
An important factor that needs to be taken into consideration is that the organization should avoid duplication of data in order to perform the quality checks - wherever possible of course. Duplication of data is a data transformation process in itself and can introduce flaws into the data being tested. The ownership of the replicated data can also be troublesome, with specific data retention and security controls in place. The data quality control's role is not to persist business data and deal with all the complexities resulting from this action.
The right implementation of data quality controls augments business data flows with reports generated at their various transformation stages, providing better view on their correctness.