Accuracy, more than most of the other data quality characteristics, is often considered to be the foundational dimension of good data quality
Accuracy is one of the 16 characteristics or dimensions of data quality. It is considered to be the foundation for good data quality. It is also the traditional starting point for any continuous improvement program. It is one of the data quality dimensions that is categorized as being intrinsic to the data itself. A piece of data is accurate if it reflects the true nature of the object, event, or concept that it represents. All customers want the data they need and use to be correct. They want the data values they are using to agree with the event that has occurred or is occurring, to reflect the proper physical characteristics of an object such as its height, width, length and weight, or to properly record the financial condition of the organization. Executives expect that the organization’s data is accurate.
Data Accuracy Examples
People have long advocated that the best way to measure data quality is to measure its accuracy. In some instances, this is easy to accomplish. One example can be shipping goods from a manufacturer to a retailer. The size and weight of the goods shipped is important for transportation regulations and taxes, storage space restrictions, and the correct execution of self-scanning check-out processes within retail stores. In this case it is easy for all parties to check the accuracy of this data. Each user can measure one or more instances of the physical object independently, to determine if the data they and their computers are using is correct.
It is also easy to quantify the inaccuracy and to communicate occurrences of poor data quality. Another example would be the date of birth for an employee. While there is no direct measurement that can be applied for this data value, the employee can review the data and indicate if it is recorded correctly.
Issues with Determining Data Accuracy
There are instances where determining the accuracy of the data is more difficult. If the data represents historical facts, there may be no physical evidence with which the data can be corroborated. In some instances, it may be possible to have more than one correct value. One example is the term “net profit.” Currently, there are two standards used for determining net profit. The United States uses the Generally Accepted Accounting Principles (GAAP) and the rest of the world uses the International Financial Reporting Standards (IFRS). Differences between the two accounting systems could make it difficult for investors to compare companies, even firms in the same industry. Under U.S. GAAP, research and development costs, for example, generally are expensed when they occur. Under the international standards, once a project gets to the development stage, costs are spread out over time. Therefore, a company could show different operating income and net profit depending on which system they use.
Currently foreign organizations filing financial reports in the United States must reconcile the IFRS data with GAAP and highlight the differences. Under the current rules and reporting restrictions there are two different values for “net profit” that are correct. The situation or condition represented by the term “net profit” could be two different conditions. If an organization were marginally profitable, the difference could result in reporting a profit or a loss. Both representations would be accurate. Interpretation of the results could be vastly different and lead to different actions by users of the data (decision-makers).
Another example of the challenges associated with accuracy involves the difference between accounting data and data used to record specific activity within an organization. When a manufacturing firm was developing an addition to its global data warehouse involving the purchasing of raw materials on a global basis, it tried to verify the data collected with previous financial statements and their supporting documents. The data for the financial statements were collected in each subsidiary and sent to the corporate offices for preparation of the corporate financial statements.
The intent of the data collection for the global data warehouse was to develop information about the purchase of raw materials from vendors on a global basis. The subsidiary financial data was collected using spreadsheets which were completed in each subsidiary from data processed in the corporate accounting system. The data for the global data warehouse was captured directly from existing computer systems used by the subsidiary.
When checking the data, the team discovered errors that occurred while transforming the data from the spreadsheets into the corporate accounting system. A decimal point had been misplaced, resulting in a difference in the millions of dollars. This error occurred since base unit of measurement differed between the operating systems and the spreadsheets. In the operational systems, the monetary values were recorded in actual currency units, while the spreadsheet recorded the data in thousands of dollars. The decision was made to record the data in the global data warehouse as collected from the operational systems and to not restate the financial statements because they believed the difference was not material for analysis.
This leads to an interesting data quality question about accuracy. Now, there were two sets of data, each claiming to be accurate, that existed in the global data warehouse for the same activity in the organization. In this instance, how important was it to the organization that the two numbers be identical? What was the definition of “accuracy” for this activity? Two people using these two sets of data to analyze and recommend a possible course of action for the organization could have different conclusions. Was the organization comfortable with this situation?
While data accuracy has to be the foundation for high data quality within an organization, its definition can be (and should be) debated and resolved within an organization. Furthermore, it is not the only characteristic that determines the level of data quality. Organizations should examine other data characteristics that affect data quality and the interaction between these different data quality dimensions.