Observations on Data Quality

Article Summary: The importance of high quality data for operations and analytics is not disputed. However, many organizations struggle to understand the value of data quality. The costs of poor data quality can be extremely high, and it can include more than just financial losses. There are a number of indicators of quality data, including accuracy, integrity, consistency, lack of redundancy, adherence to business rules, timeliness, and user satisfaction.

Last Updated: September 21, 2024

The importance of high quality data for operations and business intelligence / analytics is not disputed. However, many organizations continue to struggle with understanding the value, identifying the correct steps to take, and implementing those practices consistently. The costs of poor data quality can be astronomical, and can include more than financial losses.

What Is Data Quality?

There are a number of indicators of quality data.

The Data Is Accurate – The data entered is exact. For example, a customer’ name is spelled correctly and the address for that customer is correct and current. If the Marketing Department doesn’t have the correct profile for the customer, Marketing will attempt to sell them the wrong products and present a disorganized image of the organization. When data on a company vehicle is entered into the system, it may be valid (a vehicle number that is in the database), but it may be inaccurate (the wrong vehicle number associated with that vehicle).
The Data Has Integrity – The data will not be destroyed or altered accidentally. Updates will not be lost due to conflicts among concurrent users. Much of this is the responsibility of the DBMS, but proper implementation of the DBMS should not be assumed. Robust backup and recovery procedures as implemented by the installation are needed to maintain some levels of integrity. In addition, operational procedures that restrict a batch update from being run twice are also necessary.
The Data Is Consistent – The form and content of the data should be consistent. This allows for data to be integrated and to be shared by multiple departments across multiple applications and multiple platforms.
The Databases Are Well Designed – A well designed database with appropriate and complete data models will perform satisfactorily for its intended applications, it is extendible, and it exploits the integrity capabilities of its DBMS.
The Data Is Not Redundant – In actual practice, no organization has eliminated redundant data totally. For certain performance reasons, and in some distributed environments, an organization may choose to maintain data in more than one place and maintain the data in more than one form. Data management best practices must be observed even when redundant data is necessary.

The redundant data to be minimized is the data that has been duplicated for none of the reasons stated above but because:

The creator of the redundant data was unaware of the existence of available data.
The redundant data was created because the availability or performance characteristics of the primary data were unacceptable to the new system. This may be a legitimate reason, or it may be that the performance problem could have been addressed successfully with a new index or a minor tuning effort, and that availability could have been improved by better operating procedures.
The owner of the primary data would not allow the new developer to view or update the data.
The lack of control mechanisms for data update failed to indicate the new version of the data.
The lack of appropriate security controls dictated the need for a redundant subset of the primary data to be stored in a more secure environment.

In these cases, redundant data is only the symptom and not the cause of the problem. Only managerial vision, direction, and a robust data strategy would lead to an environment with less redundant data.

The Data Follows Business Rules – As an example, a loan balance may never be a negative number. This rule comes from the business side and IT is required to establish the edits to be sure the rule is not violated.
The Data Corresponds To Established Domains – These domains are specified by the owners or users of the data. The domain would be the set of allowable values or a specified range of values. In a Human Resource System, the domain of Position Status may be limited to “Current” and “Former.”
The Data Is Timely – Timeliness is subjective and can only be determined by the users or data stewards. The users will specify that monthly, weekly, daily, or real-time data is required. Real-time data is often a requirement of production systems with on-line-transaction processing (OLTP). If monthly is all that is required and monthly is delivered, the data is timely.
The Data Is Well Understood – It does no good to have accurate and timely data if the users don’t know what it means. Naming standards are a necessary (but not sufficient) condition for well-understood data. All appropriate business metadata should accompany all data. Without clear definitions and understanding, the organization will exhaust countless hours trying to determine the meaning of their reports or draw incorrect conclusions from the data displayed on the screens.
The Data Is Integrated – An insurance company needs both agent data and policyholder data. Typically, these are two files, databases, or tables that may have no IT connection. If the data is integrated, meaningful business information can be readily generated from a combination of both the agent and policyholder data.

Database integration generally requires the use of a common DBMS. There is an expectation (often unfulfilled) that all applications using the DBMS will be able to easily access any data residing on the DBMS. An integrated database would be accessible from a number of applications. Many different programs in multiple systems could access and, in a controlled manner, update the database.

Database integration requires the knowledge of the characteristics of the data, what the data means, and where the data resides.

An integrated database would have the following potential benefits:

Less redundant data
Fewer possibilities for data inconsistency
Fewer interface programs (a major resource consumer)
Fewer problems with timing discrepancies
More timely data

The Data Satisfies The Needs Of The Business – The data has value to the enterprise. High quality data is useless if it’s not the data needed to run the business. Marketing needs data on customers and demographic data, Accounts Payable needs data on vendors and product information.
The User Is Satisfied With The Quality of the Data and the Information Derived from that Data – While this is a subjective measure, it is, arguably, the most important indicator of all. If the data is of high quality, but the user is still dissatisfied, there are data management challenges to address.
The Data Is Complete – All the data expected to be stored is available.
There Are No Duplicate Records – A mailing list would carry a subscriber, potential buyer, or charity benefactor only once. You will only receive one letter that gives you the good news that “You may already be a winner!”
Data Anomalies Are Addressed – From the perspective of IT, this may be the worst type of data contamination. A data anomaly occurs when a data field defined for one purpose is used for another. For example, a currently unused, but defined field is used for some purpose totally unrelated to its original intent. A clever programmer may put a negative value in this field (which is always supposed to be positive) as a switch.

Design Reviews

An important set of information to be included in design reviews is the requisite quality of the data under consideration and the actual state of the data. The basic question to be asked is “How clean, timely, etc. must the data be?” In the design review, the team members would consider the data source, the process of update and delete, and the quality controls imposed on those accessing the data, and the relevant data governance policies.

The Design Review would review and validate that standards are followed. The review process may make recommendations to clean up the data, establish strict controls on shared updating, and assure sufficient training for users who would query the data.

Assessment of Existing Data Quality

As people overestimate the intelligence of their grandchildren and the sweet nature of their dogs, organizations overestimate the quality of their own data. Generally, a reality check is needed. Poor quality data can be detected in a number of ways:

Programs that abnormally terminate with data exceptions.
Clients who experience errors or anomalies in their reports and transactions and/or don’t trust their reports or don’t trust the data displayed on their screens.
Clients who don’t know or are confused about what the data actually means.
On-line inquiry transactions and reports that are useless because the data is old.
Data that cannot be shared across departments due to lack of data integration.
Difficulty for clients to get consolidated reports because the data is not integrated.
Programs that don’t balance.
In the consolidation of two systems, the merged data causes the system to fail.

Quality may be free but data quality does require an initial investment. It takes people and resources to bring data to the desired high quality state. If data is allowed to remain in its current (dirty) state, there may be a substantial cost and disruption to the organization. Very few organizations understand the costs and exposures of poor quality data. An assessment shows the current state of data quality as well as other components in data management.

Effects of Poor Quality Data

Data is an asset but it can only be an asset if the data is of high quality. Data can also be a liability if it is inaccurate, untimely, improperly defined, etc. An organization may be in a better position not having certain data than having inaccurate data, especially if those relying on the data do not know of its inaccuracy. A hospital would be better off not knowing a patient’s blood type than believing and trusting it to be “O+.”

Which Data Should Be Improved?

It should be obvious that it’s impossible to improve the quality of all the data in an installation. The prioritization is much like triage. The energy should be spent on data where the quality improvement will bring an important benefit to the business. Other criteria that would suggest data improvement is data that can be fixed and kept clean. Unimportant data can be ignored. Data that will become obsolete can also be bypassed. Examples are:

The business will be bought
The data will be converted because of a new application
A re-engineering of the business will cause the identified poor data to be retired

There will be wide variations in the costs to clean different files and databases. This will enter into any decision about which data to purify. The cost of perfectly accurate data may be prohibitive and may not be cost effective. Based on the source of the data, total accuracy may also be an impossibility.

Users of data may be willing to settle for less than totally accurate data. Even so, it is important that the users know the level of quality they are getting. A greeting card company asked their retailers to measure the number of linear feet devoted to that company’s card products. Those who analyzed the data knew the data to be inaccurate but preferred inaccurate data to no data at all. A large computer manufacturer asked their marketing representatives and technical engineers to report on how they spent their time. It was well known that the respondents were not keeping very good records themselves and their reports reflected the lack of their concern for accuracy. Those who analyzed the data knew of the inaccuracies but were looking for trends and significant changes to indicate shifts in how jobs were performed. The inaccurate data, in both of these cases, was acceptable to the relevant user community.

Data Purification Process

To clean up the data, the following steps should be followed:

Determine the importance of data quality to the organization.
Assign responsibility for data quality, usually to data stewards and other subject matter experts.
Identify the enterprise’s most important data.
Evaluate the quality of the enterprise’s most important data.
Determine users’ and owners’ perception of data quality. Users will convey their understanding of the data’s quality and may indicate why the data has problems.
Prioritize which data to purify first.
Assemble and train a team of data stewards and data quality professionals to clean the data.
Select tools to aid in the purification process.
Review data standards.
Incorporate standards across the organization to ensure that all initiatives and projects deliver high quality data.
Provide feedback and promote the concept of data quality throughout the organization.

Roles and Responsibilities

The creation and maintenance of quality data is not the sole province of any one department. The importance of quality data must be understood by senior management and expressly communicated throughout the organization. Words are not as important as deeds. When quality measures appear in performance plans, reviews, and bonuses, people finally believe that quality is important. It is equally important that time and resources be allocated to development schedules to support management’s commitment to quality.

Data Quality and Data Warehouse

Bad data should never be allowed into a data warehouse or BI/analytics environment unless the problems are recognized and acknowledged by those who will use the data. Whenever possible, the data should be validated and purified prior to extraction. If bad data enters the data warehouse, it may have the effect of undermining the confidence of those who access the data. Business users and IT must be able to rely on the data, regardless of whether it is detailed, summarized, or derived.

The effort to clean up data once it is in the data warehouse becomes a major and never-ending task. It should not be the responsibility of those administering the data warehouse to clean up bad data. The cleanliness standard puts an additional burden on the data stewards to perform validations of the source data.

Assessing the Costs of Poor Quality Data

It will be difficult to assign real dollars to most of these categories. If estimates in real dollars are possible, conservative numbers should always be used. When an organization has experience with any of the following problems and if the costs of fixing those problems have been calculated, those figures can be assigned.

Bad decisions due to incorrect data.
Lost opportunities because the required data was either unavailable or was not credible.
Time and effort to restart and rerun jobs that abnormally terminated due to bad data.
In a buyout situation, accepting too low a price for the business because it cannot properly demonstrate the business potential, or the perception of disarray or incompetence because reports and other business intelligence results are inconsistent.
Fines imposed by regulating authorities for non-compliance or violating a governmental regulation as a result of bad data.
Time and resources to fix a problem identified in an audit.
Hardware, software, and programmer/analyst costs as a result of redundant data.
The costs and repercussions of bad public relations due to bad or inconsistent data. An example would be a public entity unable to answer questions from the press or from their Board of Directors.
Time wasted by managers arguing and discussing inconsistent reports which are the result of bad data.
Poor relations with business partners, suppliers, customers, etc. due to overcharging, underpayment, incorrect correspondence, shipping the wrong product, etc.
The time spent correcting inaccurate data. These corrections may be performed by line personnel or by IT.
The costs of lost business in operational systems because of poor quality data (data was wrong or non-existent). An example is the lost marketing opportunity for an insurance company that does not have accurate information about a client and thus loses the opportunity to market an appropriate insurance product.

Data Quality Feedback to Senior Management

Unlike measurements of performance and availability, the quality of data will not change daily. Quality can be quickly compromised by operating procedures that cause improper batch updates. Those responsible for data will want to make periodic checks to determine trends and progress in improving data quality. The results should be reported to IT management and to the departments that own the data.

The quality of data can be measured, but before any measurement takes place, the following questions should be answered:

Why is the quality of the data being measured? – The classic answer is that without measurement, management of the data is impossible.
What is being measured? – Some possibilities include:
1. trends, i.e. is the data getting cleaner or dirtier?,
2. user satisfaction with the quality of the data
What will be done with the measurements? – Some possibilities include: 1) focus on the data that needs to be purified, 2) provide a basis for cost justifying the purification effort, and 3) give information for prioritizing the cleanup process.

Conclusion

There are various categories of data quality and a variety of ways to identify data quality problems and solutions. Data is a critical asset for every enterprise. The quality of the data must be maintained if the enterprise is to make effective use of this most important asset. Improvements in data quality do not just happen; they are the result of a diligent and continuing process of improvement.

Sid Adelman

Sid Adelman founded Sid Adelman & Associates, an organization specializing in planning and implementing Data Warehouses. He has consulted and written exclusively on data warehouse topics and the management of decision support environments.