5 ways to measure Data Quality

Lot of organisations have departments aimed at improving data quality.  However if some one asks the whether the data quality has been improved, the generic answer is “It depends”. Does it have to be this way?

 The word “Data Quality” means different things to different people, however generally it means the Data meets the CAT (i.e. Completeness, Accuracy and Timeliness) requirements of the Consumer.

Some data practitioners enhance data quality with other dimensions such as uniqueness, consistency etc. However as a minimum, the data should meet the CAT requirements.

Definition of the CAT requirements is the fundamental part in the measurement of Data Quality. Listed below are the 5 ways to measure CAT requirements for different types of data

 1)     Whether the data item has any null values at an attribute level

This is the most basic form to check if there are any gaps in the data that’s being sourced externally or generated internally. Almost all of the software today in the market has some inbuilt validation which does not allow saving a particular record if there are incomplete data keyed in by an user. However if the data is loaded via bulk transfer or sourced externally, the loading process might continue to run and save the data onto system however could generate exception reporting This report is then verified to ensure all the exceptions are cleared. In practice, due to resource and time constraints it is possible that not all the exceptions might be attended to within a particular period of time before the Consumer receives the data. A separate report can be run which can show the incomplete data and identify areas to improve

 2)     Whether all records have been included in the data set

This is another dimension of Completeness, some refer this as “Coverage”. This dimension does not refer to a particular attribute, instead refer to a row, or a record in a data set. For example if some one is analysing the effectiveness of sales force, does the data set includes all sales reps across all regions both part time and full time employees including staff who are engaged on a commission only basis.

 3)     Whether the data is accurate

Of all the data quality dimensions, “Accuracy” is the most contentious one, as the definition of accuracy varies depends on the type of data. Probably we can delve into Accuracy on its own as a separate article, but at a high level there are number of ways accuracy can be defined

a.      Conformity

Whether the data conforms to an existing standard by which the data has to be distributed, for example country codes have to be ISO standard, Securities have to be identified with SEDOL / ISIN etc.

b.     Number of manual touches to a data

The number of manual touches by itself does not invalidate the accuracy of data, however if there are too many manual amendments to the data, it could prompt a review as to why the data is constantly amended

c.      Code changes

There are a number of occasions where the data is calculated in the system through a formula or a code. In those instances, it could be that whether if there are any changes to the code which calculate such data, and whether such code changes are validated by the relevant team before they are implemented.

There are a number of other ways Accuracy can be measured, which we can look separately.

 4. Whether data is delivered on time

This is generally an easy one to understand to explain, however sometimes can be interpreted differently. Generally, timeliness is measured when the producer of data has delivered within the time as requested by Consumer. However there are different schools of thought on measurement, where the quality is marked as  “red” when the data is delivered after the agreed time, irrespective of whether it is a minute or 10 minutes after the agreed time; the another scenario is quality is marked as “red”, only if it exceeds a particular amount of time after the agreed time. There is no right or wrong way, but the downstream impact determines how the timeliness is measured.

5. Number of data corrections made

This is the number of corrections made by the downstream user on the data, because the data they received was not of good quality (it does not matter whether it was due to completeness, accuracy or delay in receipt of file). There are two parts to this metric, Firstly on the corrections itself, i.e. what type of corrections are made, and why those corrections are made etc. and Secondly the turnaround time to rectify those corrections in the original source.

Different organisations are in different stages of maturity in data governance, it is imperative that an organisation does not go for a “big bang” approach to measuring and defining data quality metrics. It is ideal to start small on highly critical data sets and expand the scope further as the organisation matures.

I have intentionally omitted covering the change aspects regarding the measurement, monitoring and escalation of data quality metrics. It is highly dependent on organisational culture, existing hierarchy levels, and senior management vision on improving data quality.

If your organisation is willing to embark on a journey to improve data quality please do not hesitate to contact us

Tagged with:

Leave a Reply

Your email address will not be published. Required fields are marked *