Data quality is dependent on achieving measurable goals across six critical dimensions
All professionals understand the importance of making data-driven decisions. But the presence and usage of data do not guarantee efficient and productive results for any organization. What really matters is having quality data.
IBM estimated that bad data costs the U.S. economy $3.1 trillion per year. These costs are incurred when an organization spends its resources to clean data or fix the problems caused by bad data. And if left unattended, bad data can damage relationships with customers and partners, as it causes operational inefficiency.
As a result, it has become necessary to feed good quality data to all the organizational processes. But how can an organization measure data quality?
Data quality depends on the purpose the data fulfills in the organization. If the sales team uses customer accounts data to determine the number of customers, then any duplicate records can mislead the team to make inaccurate assumptions. So, in this case, data quality means the uniqueness of data records. For other uses, data quality can mean something different.
To ensure data quality, having a data preparation strategy is essential, to clean the data so it can be used for processing and analysis. Data preparation software can be extremely useful, but first it is necessary to know how data quality is measured.
Six critical dimensions to measure data quality
Does the organization capture the data it needs?
One of the biggest data challenges encountered is the absence of required data. Many organizations do not analyze their requirements to understand the kind of data they need for various operations. Therefore, critical data attributes are missing from data records, as well as the metadata (or context). Once the needed data is identified, often the additional data requirements are introduced later in the data lifecycle, which leaves empty values in current data. Every firm should target data quality from the beginning, which means realizing the value desired from the data and what type of data will fulfill that requirement.
Is data present in an acceptable format and type?
It is not enough for data to exist; the data type and format should be valid too. Data validity relates to having the data stored in a format that follows the correct validation pattern. Invalid data cannot be processed. If data exists but is not valid, it is of no use to the organization. Examples of data validity would be data values for email addresses contain an ‘@’ symbol; dates following a specified format such as MM/DD/YYYY; etc.
Do data records placed at different stores match each other?
Organizations have a plethora of tools and third-party applications integrated into their business processes. Each tool may provide data information for certain purposes. Data consistency ensures that the data contained at each of these sources represents the same values. Without consistent data at these sources, processes and teams will access multiple versions of the same data. This causes conflict at the operational level and for business intelligence / analytics.
How well does the data reflect reality?
One of the most important factors for data preparation is data accuracy. It is not enough to capture the right data; it must remain accurate. When data is moved from one system to another, it may be transformed, and different standardization rules may be applied. Often, the data values are affected; and inaccurate information is stored. For this reason, every organization should have standardized data entry and transformation rules, so data integrity remains intact.
Is data updated on a timely basis?
It is important that data is available as soon as the event occurs. This enables the timely delivery of critical information for operations and for decision making. For example, if a prospect made a purchase from the organization, their information should be updated in the CRM to reflect the correct numbers before the weekly sales meeting. Otherwise, the outdated data will mislead the team and cause them to base critical decisions on incorrect data. To reduce the time data takes to be available, ensure that the data integration architecture is designed to be as simple as practical.
Is data free of duplication?
Uniqueness means that each record reflects a single entity and there are no duplicates present in the data. But often, when data goes through migration or merging, duplicate records are created that represent the same entity. To de-duplicate, select unique identifiers in the database design, and compare and merge the records that belong to the same entity.
Sometimes de-duplication can be harder to perform. For example, in healthcare, officials want to guard patient confidentiality, so they remove personally identifiable information (PII). In such scenarios, having exact matches may be a challenge; so analysts should perform fuzzy matching to get approximate matches.
As most organizational processes depend on data, a data hygiene strategy is critical for the success of any organization. It may seem like a time-consuming and resource-intensive process, but it does lead to metrics and insights that matter. These data quality dimensions are a great place to start but make sure to use them to determine the root cause of the problem. Try addressing the reasons that data is not accurate, valid, unique, complete, consistent, and timely. Doing so will help determine the best strategy for keeping data clean over time.