ISO 8000 defines the parameters of data quality, so one can ask for quality data and verify that one has received quality data
It is common to see the relationship between data and information represented as a pyramid with data as the base rising through information and knowledge to wisdom as the apex. In researching the origin of this hierarchy, one discovers that the original reference may be in The Rock (1934), a poem by T. S. Eliot (1888-1965) which includes the following lines “Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?” Of course, in T.S. Eliot’s poem, data was conspicuous by its absence but it is not hard to correct the omission of so important a concept; commonly data is added as the solid foundation of a rising and imposing pyramid (see diagram 1).
Diagram 1: Data-Information-Knowledge-Wisdom Pyramid
The Semantic Conceptions of Information was first published in the Stanford Encyclopedia of Philosophy in October 2005, and it remains a seminal work on the subject. The definition of a datum as “a disruption in a continuum” is as brilliant as it is minimalist (a black dot on a white sheet of paper is a datum) especially when extended to the definition of information as “meaningful data.”
While developing ISO 8000, the international standard for data quality, the committee spent days debating the definitions of both data and information before ruling that as the standard was not about information there was no requirement to define it. As for “data”, the committee took the allowable position in standards development, that if it uses the term as it is defined in the Oxford English Dictionary (OED) then all is well, and in the end the committee did not define it. “The Professor and the Madman” by Simon Winchester is a great story which also provides a good explanation of what it takes to work on definitions. As the project leader of the ISO 8000 standard this author agrees with the editor of the standard, Dr. Gerald Radack, in the practical definition of data as “the electronic representation of information”. The term electronic was added to avoid having to cover other forms of data representations, of which there are many.
The trouble with the definitions of data and information as electronic representation of information and meaningful data are that they are circular. If someone wishes to dispute this modest technicality, please refer to the true definition of datum as “a disruption in a continuum.” Of course, there is “the chicken or the egg” problem (which came first), data or information? It appears that for all practical purpose one should look upon data not as the base of the pyramid, but as the transfer medium for information (so information did come first). Using this model clearly one can identify three separate processes (1) the translation of information into data, (2) the transfer of data and (3) the translation of data into information (see diagram 2).
Diagram 2: Data Transfer and Translation into Information Processes
In developing ISO 8000, the committee focused on the characteristics of data quality that it could objectively (or as near to objectively as possible) measure. The first of these characteristics was syntax. One of the experts who participated in the early development of the standard unwittingly provided a definitive example. He sent the group a paper arguing that there was no difference between the concepts of information and data and he chose to send the document in the well-known Adobe Portable Document Format (PDF). The document attached to his email was received by all the members of the committee with a pdx file extension. What was dismissed as a mere “technical glitch”, easily solved by renaming the file, is in fact a clear demonstration of what became the first characteristic of data quality – the identification of syntax. ISO 8000 requires that data must clearly identify the syntax used to format it. Suffice to say that the argument that data and information were the same concept was also proven to be incorrect and this conclusion can be made without even having to read the argument – sometimes one can get lucky.
Having neatly resolved the syntax issue by simply stating that one is required, the committee turned its attention to the second characteristic, semantic encoding. Here again, there are some very strong opinions and not a lot of agreement. On the one side there are the ontologists, and on the other the terminologists and never the twain shall meet. (Two of the most ardent proponents for the two extreme positions of models versus terminology are both French but they have yet to meet but this may be good. Dr. Gerald Radack believes that putting the two in the same room may have the same result as bringing matter and anti matter together).
In the end the committee, while not excluding the ontology view, took the terminology view and as currently written ISO TS/8000-110:2008 requires that semantic encoding be accomplished by including explicitly defined metadata or preferably a reference to an external open technical dictionary.
Having covered the technical characteristics of data quality the committee was left with the fundamental quality characteristic. Quality is defined in ISO 9000 as the “degree to which a set of inherent characteristics fulfills requirements” so if data is to be considered “quality data” it must meet defined data requirements.
Luckily, there is a companion standard (ISO 22745) that defines how data requirement statements should be constructed and the standard provides examples in xml. ISO 22745 also defines a format for the exchange of encoded data, again with examples in xml. In practice implementing ISO 8000-110:2008 is really not such a daunting task as it appears.
An analyst could create an ISO 8000-110:2008 compliant spreadsheet as long as it included definitions for the labels that were used as column headers. However, in reality the objective of the guideline is that the request for data and the reply to the request should be automated. This is achieved by using the ISO 22745 compliant ECCMA Open Technical Dictionary (eOTD) to encode both the request and the reply. The process starts by creating a data requirement statement which in the data cleaning industry is called an Identification Guide or a cataloging template, typically this is created in the eOTD-i-xml format.
This identification guide is then used to create a request for data as an eOTD-q-xml formatted email attachment. Finally the reply is returned in eOTD-r-xml format. Both the query and the reply fulfill the requirements for ISO TS/8000-110:2008 quality master data.
The real power behind ISO 8000 is not only can one define what is and is not quality data, but one can ask for quality data and verify that one has received quality data.