On Data Quality
Every organization needs data. It is often said that the data is the new oil fuelling the economy. This is certainly true. Organizations spend significant portions of their money and efforts to acquire and manage data. But the quality of the data is paramount to their investments. If the data is of poor quality or is stored and managed inefficiently, its value to the organization is diminished.
Risks associated with poor data quality
Data quality, as part of the data management program in an enterprise, should be a concern of both technology and the business organization. The quality of the data should be clearly defined and required by the business, while the technology should ensure that it is implemented in an efficient and manageable way.
What are the risks associated with poor quality of the data? Some of them are listed below:
- Regulatory - Some organizations must report their operations to the regulatory bodies. If the data quality does not match the requirements, regulators can impose hefty fines on the organizations.
- Operational - If an organization does not have accurate data on various aspects of its operation, there is a certainty that mistakes will be made, rework will be required and more staff than necessary will be needed to perform the organization's activities.
- Strategic - Enterprises collect and process data that helps them to understand the economic environment they thrive in. If the data is inaccurate, the organizations may miss opportunities to innovate and adopt strategies allowing them to beat the competition.
The risks described above imply that data quality, and data management in a broader sense should be the responsibility of the whole organization, not only its technology division. Technology should provide means to efficiently acquire, store, process, and retire data. Depending on the nature of the data stores, various tools can help the organization to perform these operations.
Data quality through comparison
On a very low, technology level, data quality verification is to ensure that the data attributes match certain criteria, either on a detail or aggregate level. The verification can be implemented as a data comparison against an agreed baseline or template, and any diversions from the expected results of comparison should be flagged to the appropriate data management professionals.
Some of the checks can be implemented as direct comparisons of data attributes, but of course, data quality is a much broader concept. Certain quality criteria require that a particular measure is within certain boundaries, whether a particular process produces an appropriate number of rows etc - effectively, the metadata parameters are tested.
Most such scenarios can be transformed into tabular data sets that can be compared with predefined templates using the declarative syntax of SQL. There are advantages of this approach:
- Tabular data are easily stored in databases or CSV files, facilitating further analysis and troubleshooting.
- SQL is a well-known standard for which there is plenty of talent on the market.
- Additional controls can be added to existing data sets without the requirement of writing elaborate code.
Data replication
An important factor that needs to be taken into consideration is that the organization should avoid duplication of data to perform the quality checks - wherever possible of course. Duplication of data is a data transformation process in itself and can introduce flaws in the data being tested. The ownership of the replicated data can also be troublesome, with specific data retention and security controls in place. The data quality control's role is not to persist business data and deal with all the complexities resulting from this action.
The right implementation of data quality controls augments business data flows with reports generated at their various transformation stages, providing a better view of their correctness.
Leave a comment
Please note, comments must be approved before they are published