Affiliated with:

Challenges in Achieving Data Quality Accuracy

image 31

Determining “accuracy” is not always an easy task, and not all data lends itself to discovering a single value for data quality accuracy

Several recent articles have brought to light how difficult it can be to determine and maintain the data quality characteristic” accuracy” for a piece of data or information.  In a recent article about global warming in the Wall Street Journal, NASA’s Goddard Institute for Space Studies indicated that it revised records indicating which year was the warmest on record.  The new information indicates that the warmest year on record is 1934, not 1998.  In addition, it was discovered that NASA made a technical error in standardizing the weather / air temperature data for the period from the year 2000 until the present for the United States.  Based on the old data, it was reported that six of the 10 hottest years on record have occurred since 1990.  The revised data indicates that six of the 10 hottest years occurred in the 1930s and 1940s.  Newer measurement methods allow for more precise calibration, and different definitions of temperature gradations, taken at different locations, can account for the revisions.

Insurers Require Accurate Loss Data

The New York Times has devoted articles to discussions concerning methods used to calculate the potential occurrence of common risks and natural disasters, since an insurance company can function only if it is able to control its exposure to loss.

Auto insurance companies cannot forecast an individual automobile accident, but the total number of accidents over a large population is amazingly predictable.  The company knows from past experience what percentage of drivers it insures will file claims and how much those claims will cost.  However, the risk associated with insuring natural catastrophes is much different.  The Times’ articles relate how the insurance industry was unprepared for the claims generated by several major storms, including Hurricane Andrew in 1992, Hurricane Katrina in 2005, and Hurricane Sandy in 2012.  Insurance companies had ignored historical data about frequency of hurricanes hitting the U. S. coast, nor did companies account for the tremendous growth in property value along the coasts most vulnerable to hurricane landfalls.  They had underestimated and failed to understand the effects of losses from such catastrophes. Insurance premiums had been set too low because insurers and risk managers had not assessed properly the risk involved in insurance policies related to such disasters, even though accurate data was available to them concerning the relative risk and the magnitude of potential losses.  As an example, the losses in Dade County (Miami) from Hurricane Andrew exceeded all the insurance premiums ever collected in that county.

Revenue and Profit Predictions Need Accurate Data

Image 32

Another example involves global organizations that report trends in sales.  Most report year-over- year trends in sales that usually are presented as a percentage increase or decrease in the related global sales volume.  Many times, the report does not take into account variations due to inflation or currency exchange rates.  If an organization only is concerned with a one-year trend for reporting purposes, these variations tend to be small and probably have little impact on any decision based on the calculated change.

If the trend is calculated over a longer period of time these variations can have a large impact on any decision based on the trend.  An organization can overcome some of these issues by maintaining a series of related pieces of data.  Sales amounts in the original currency would be maintained because that is what the local operation would use.  When converting those original sales amounts to another currency, the exchange rate used to make the conversion should be maintained.  Additionally, the inflation rates for both currencies should be stored and related to the sales figures for balancing the analysis in the future.  By keeping all the basic data and metadata (context for the data values), organizations retain the capability to calculate and report trends appropriate to the situation.

As an aside, users of trend information from sources outside their own organization should always question how the numbers were calculated so they fully understand what the numbers are trying to convey.  To lend support to analytics, organizations can add supporting information about the foundational data, thus allowing users to make more robust trend calculations.

Accuracy Approaches Change over Time

One may ask what is the point of these anecdotes?  There was never a question about the accuracy of the underlying basic data that measured the phenomena associated with the calculations.  The NASA revision did not change the basic weather data.  It changed the method used to standardize the data, so it could be compared over time more reliably.  Similarly, improved collection and understanding of risk, loss, and claims data can improve the accuracy of insurance actuarial science and results.  Corporate financial data needs attention to an expanded view of accuracy as well.  Experience shows that such basic data tends to be very accurate. Most accuracy issues arise for basic data that is not used in an immediate transaction, such as an order, a claim, or a shipment.

When data is generated that is not subject to a normal feedback cycle, such as a customer reviewing an invoice or a claims agent reviewing an insurance claim, data errors are more prevalent.  When this data is used to calculate another number, it is easy for the errors in the underlying data to get lost, and over time, for those errors to become undetectable. It is also easy for numbers to become less accurate when they are aggregated.

Conclusion

There are many dimensions that constitute high quality data.  Accuracy means much more than simply “is a value correct?”.  Organizations must identify the full spectrum of these dimensions and understand each dimension’s complete capability when defining them for use within the enterprise.

LinkedIn
Facebook
Twitter

Richard Y. Wang, Ph.D.

Richard Y. Wang is Director of the MIT Chief Data Officer and Information Quality (CDOIQ) Program. He is a pioneer and leader in the research and practice into the role of Chief Data Officer (CDO). Dr. Wang has significant credentials across government, industry, and academia, focused on data and information quality. Dr. Wang was a professor at the MIT Sloan School of Management; he was appointed as a Visiting University Professor of Information Quality for the University of Arkansas at Little Rock. He is an Honorary Professor at Xi’An Jiao Tong University, China.
Wang has written several books on data / information quality and has published numerous professional and research articles on this topic. He received a Ph.D. in Information Technology from the MIT Sloan School of Management.

© Since 1997 to the present – Enterprise Warehousing Solutions, Inc. (EWSolutions). All Rights Reserved

Subscribe To DMU

Be the first to hear about articles, tips, and opportunities for improving your data management career.