The Semantic Web facilitates the discovery and sharable usage of information. Data Governance and Data Stewardship programs can take advantage of this capability by implementing semantic technologies
Data management professionals should understand the Semantic Web and semantic technologies, and how they affect different components of Enterprise Information Management (EIM). The Semantic Web and semantic technologies can help enable the most critical aspect of EIM: Data Governance and Data Stewardship.
Overview of the Semantic Web
According to W3C “The Semantic Web is a Web of data.” The Semantic Web and semantic technologies make it easier to find, share, and use information, and will increase opportunities for automation. For example, NASA is using semantic technologies to find experts in its organization of over 70,000 civil servants.
The World Wide Web (WWW) is a “web of documents,” whereas the Semantic Web is a web of data – identifying and connecting the data elements within web documents. The WWW is being transformed into the GGG (Giant Global Graph) with interconnected, discrete data elements – basically, a giant database. Many public US government datasets have been serialized in RDF/XML (RDF is a key Semantic Web enabling technology) on data.gov.
The Semantic Web is enabled by semantic technologies, most notably RDF (Resource Definition Framework) and OWL (Web Ontology Language). RDF can describe data resources using a W3C standard, which enables the connection between disparate information from across the internet (or intranet).
For example, XML or HTML documents can be encoded with the Dublin Core “creator” metadata element to identify the author of the document. Doing so allows identification of documents authored by a particular writer, rather than doing a google-type search based on the author’s name (which will retrieve all the documents which include the author’s name – but not necessarily just those documents authored by the writer).
For internal documents, the organization could encode data with Dublin Core “creator” (and other DC elements) so that information can be found more effectively within the enterprise. Dublin Core is an example of a standard vocabulary expressed in RDF. Examples of other RDF based vocabularies include FOAF (Friend of a Friend – for social networking), SKOS (Simple Knowledge Organization System – for describing concepts). Organizations should take advantage of existing RDF based vocabularies (there are many) for describing your data – to avoid recreating the wheel. Of course, each enterprise would probably define its own vocabularies as well.
Similarly, OWL can be used to relate ontologies (which contain taxonomies and thesauri about a domain) from many domains to provide a very rich knowledge base to analyze information through the ability to draw inferences from data. This capability would not be possible without time consuming human investigation or the development of specialized programs or complex queries. When describing data and metadata in RDF and having these tied to ontologies, there are many inferences which can be drawn just by expressing the information in these languages.
Organizations can also use SWRL (Semantic Web Rule Language) to define their custom inferencing rules.
Semantic Web and Data Stewardship
A central and critical aspect of Data Stewardship is to be able to know the location of data, so its definition, quality, security, and usage can be measured and improved. For example, an organization could define a rule in SWRL which says “If a data object (e.g. database table, data model entity) has a primary key which includes an attribute equivalent to Customer Key or Customer Id and the entity contains a Social Security Number or Last Name attribute, then the Data Steward for that data object is John Doe”.
When the RDF and OWL data is ready, use an inferencing engine to make the inferences which would in turn be expressed in RDF/OWL. It becomes possible to interrogate the source RDF/OWL triples and the inferenced RDF/OWL triples together to form an enriched source of information.
Now it will be easy for someone to find all of the data objects which are assigned to the business data steward “John Doe.” It becomes easy to find all of the objects for which John Doe is the data steward using a SPARQL query.
In the source RDF file would be metadata about these data objects (i.e., resource descriptions), so the query result can be examined to find out more about the object (since the query results in this query will all be URI references), or continue the analysis with additional SPARQL queries. If there are other asserted or inferred RDF triples describing the data assets, the analyst can use SPARQL to measure and monitor the quality, definition, security, compliance with data standards, and usage of these data resources.
Data Stewards and Ontologies/Vocabularies
A Data Steward (who has assigned responsibility, authority, and accountability for oversight of the definition, quality, usage, and security of a data subject area), also needs to be involved when new ontologies and/or vocabularies are developed, changed, or incorporated. These ontologies and vocabularies may be internally or externally developed. For example, assume that an HR department has an internally developed vocabulary (in RDF) for describing employees, which was approved for use across the enterprise by the Data Governance Committee / Board / Council. Assume one property in the vocabulary is “Employee Title.”
The HR or Employee data steward would want to ensure that for employee data expressed in RDF, the FOAF (Friend of a Friend) property is not used if there are company specific definition and rules for “Employee Title,” and that it is part of the HR taxonomy. If a generic property is used instead of defined “Employee Title” property to describe an employee’s title, this would affect the ability to analyze the data or draw inferences from it.
Data Stewards should also be involved in or have oversight of testing ontologies before they are deployed, to ensure that incorrect inferences are not permitted.
Semantic Web and Data Governance
The principal way that Data Governance will interact with semantic technologies is through the governance and oversight function of Data Governance. A Data Governance board or council should make strategic decisions regarding:
- Sponsorship and oversight of semantic technology initiatives for EIM and other purposes
- Sponsorship of projects for the development of enterprise ontologies and vocabularies, and approval for use of these
- Approval of the use of external ontologies with enterprise data. For example, a data governance board for a hospital would make decisions around the use of a healthcare industry ontology, e.g., Disease Ontology
- The sharing of corporate data in an RDF/OWL format outside of the enterprise
The last point is significant. Of course, anytime that data will be shared outside of the enterprise is a cause for possible concern and therefore data governance oversight policy is necessary. When sharing data in RDF, it might possible for unintended inferences to be made which could cause concern. For example, sharing the last 4 digits of a phone number (along with other data), an external party could combine the data with other pieces of data (e.g. type of disease, admitting hospital, etc.) to identify a patient – which might violate PHI (Protected Health Information) regulations.
A Data Governance board might wish to form a sub-committee just to understand and oversee semantic technology initiatives.
Semantic technologies provide powerful capabilities for the governance, stewardship, and knowledge discovery of data and metadata resources. With inferencing, it is possible to uncover relationships which are not physically recorded or asserted. This enables improved understanding, measurement, and management of these data resources.
EIM initiatives that struggle with obtaining a holistic view of enterprise data should investigate the use of semantic technologies. Enterprise ontologies and vocabularies are assets which should have the same degree of governance and stewardship as other data assets. Data governance and data stewardship are especially important if there is any risk of falling out of regulatory compliance when sharing RDF datasets outside of the enterprise with the possibility of damaging inferences.