Getting Started On a Semantic Data Web Program

January 1, 2010 Peter Stiglich, CBIP More Data Geekery

The Semantic data web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries

Introduction

Interest in the Semantic Web has skyrocketed, along with interest in the implications of semantic technologies on traditional data management. The promises of the Semantic Web – for example, being able to find information regardless of the terms used, being able to get back meaningful search results in the appropriate context, improved machine responsiveness, etc – seem to be the Nirvana which will resolve many information problems.

This article attempts to describe the reasons Data Management professionals should pay attention to the Semantic Web (or risk becoming less relevant), identify at a high level some of the key technologies involved, and help identify a roadmap for implementing Semantic Web technologies in any enterprise.

What Is The Semantic Data Web?

According to W3C “The Semantic Web is a Web of data”. However, with the dizzying array of technologies and standards involved in the Semantic Web, it is easy to forget that the Semantic Web is all about DATA. Suzanne Acar, Principal Data Architect at the FBI, recommended the title “Semantic Data Web” so that data is not lost from the picture.

Of course, the Semantic Data Web has applicability across the internet, but semantic technologies are increasingly used by enterprises, inside organizations through the intranet, to help tie together the vast amounts of data contained in un/semi-structured data sources (web pages, documents, images, email, XML, etc.) and structured data sources (databases, meta data repositories, etc.).

The Semantic Data Web is facilitated in large part by ontologies and Linked Open Data (which will be described later). An ontology is comprised of “a collection of taxonomies and thesauri” about a domain. Concisely, ontology (in this context) is about relating terms with other terms – so that humans can find and ask questions of the data more effectively, and computers can perform more tasks without the intervention of a human. Indeed, ontologies can serve as the knowledge base for Artificial Intelligence (AI). Usage and deduction rules are also needed to make an ontology functional, but that is beyond the scope of this article. Many of these rules can be arrived at using semantic web standard technologies, e.g., OWL, SPARQL, XML.

Importance of the Semantic Data Web

Data Management professionals need to understand that data exists in a plethora of data sources, not just in databases. Indeed, approximately 80% of all data exists in semi/unstructured data sources. Being able to help the enterprise leverage all of the data that exists in the enterprise will be more and more critical to gain competitive and performance advantages, and cost savings.

Semantic Data Web technologies will help to be able to piece together all of this information. In addition, data modelers have (or should have) the understanding and experience needed to develop ontologies (but perhaps not the technical knowledge, which can be picked up without too much difficulty) by knowing how to identify, name, and classify entities, instances, relationships, properties – which are directly applicable to developing taxonomies and thesauri.

A significant risk of data management professionals not being involved in this important activity is duplication of effort – data modelers often understand the business very well, or at least a portion of it. If data management professionals think that semantic web technologies fall only into the application development realm, then their knowledge will not be leveraged and so application developers may go about redefining ontologies that are already embedded and discoverable in the enterprise. As we are in the 21st century and in the Information Age, the business will increasingly expect to be able to find information, regardless of where/how it is stored, and expect more and more process automation. Data Management professionals need to understand, be a part of, and help apply mature metadata management practices to Semantic Data Web initiatives.

Starting a Semantic Data Web Initiative

As with any enterprise initiative, it is important to have a phased approach. Do not attempt to model everything in an ontology – begin by identifying which applications can benefit from semantic technologies and counterbalance this with an estimate as to the degree of difficulty for each area (both business and technical). Identify one or a few key domains which will need to be modeled in the pilot project.

It should go without saying that implementing the Semantic Data Web in an enterprise should be based on solid business goals and objectives that come from the organization’s data strategy. This initiative should not be driven by IT (as with Data Warehousing, the “if you build it, they will come” approach will not work here either). Of course, a sandbox environment where Semantic Data Web technologies can be explored will be worthwhile to educate IT professionals with semantic technologies, while a business case/ROI is developed and approved.
There are some related terms at the core of ontologies. RDF and OWL are two W3C standards for describing ontologies, but they are complimentary and can be used within the same ontology. However, before trying to model any ontologies in RDF/OWL, there is some preparatory work that should take place, while the technical staff is learning semantic technologies and standards.

Information Sources for Developing Ontologies

Many enterprises are still struggling with some basic semantic issues. For example, enterprise glossaries are often poorly designed, non-existent, or not maintained. Semantics are all about meaning, and every organization must have someplace where key terms and acronyms are defined, in business language.

Setting up a glossary sounds like a simple task, but it is important to spend some time in planning it. For example, homonyms and synonyms are very common and so it is important to identify the context for the term and associate the organization(s) to which the glossary entry is applicable, identify the definition source, glossary entry steward, etc. More and more tools are coming out on the market to support glossaries based on the increasing number of enterprise initiatives such as data governance and stewardship, enterprise data models, master data management, etc.

Having formalized glossaries will help the ontologist (read data modeler – just using new techniques and technology) create the ontology or convert the glossary into SKOS (Simple Knowledge Organization System) for structured, controlled vocabularies (which in turn becomes part of the larger enterprise ontology). SKOS is described using RDF, RDFS, and OWL. For example, SKOS allows a user to determine whether a term is broader than, narrower than, related to another term, etc. SKOS, FOAF (Friend of a Friend, for social networking), and other ontologies are built using RDF/RDFS/OWL, serialized in XML.

There may be other controlled vocabularies in use in the enterprise which should be identified, as these will also help inform enterprise ontologies. For example, in the medical industry there are many coding schemes which can be thought of as controlled vocabularies and taxonomies, e.g. ICD9/10, SNOMED. Each of these coding schemes has a specific purpose – one might be for procedures, another for pharmaceuticals – however there is often some overlap. In one department a coding scheme might be used which classifies a heart attack as a myocardial infarction and another department uses a code which represents “cardiopulmonary arrest.” It is important to understand that many ontologies are already in existence and used in many places in every enterprise.

While they may not be modeled in RDF/OWL, ontologies exist in traditional data models – taxonomies (hierarchical relationships) might be represented with identifying relationships and subtype relationships; thesauri (associative relationships) might be expressed with other types of relationships, e.g. many-to-many). Much knowledge is locked away in every data model. Leveraging data rationalization paths (e.g. term > conceptual data entity > logical data entity > physical model table > database table) can provide a search engine user the capability to navigate to data stored in a database (assuming appropriate security and access methods in place). Embedded ontologies often exist in web pages (e.g. navigational taxonomies such as the site map), metadata repositories, BI tools (hierarchies for drill up/down), within master data stores, XML data stores, in an Enterprise Architecture (EA), and other places.

There are many industry ontologies already in use which are available on the Internet. In addition, an upper ontology such as SUMO (Suggested Upper Merged Ontology) can be leveraged as a foundational ontology, to provide a broader reach (e.g. inferencing) for semantic applications.

Inventory of Ontology Information Sources

An important step to take early in a Semantic Data Web effort is to produce an inventory of possible ontology (or ontology component) information sources. There is no need to reinvent the wheel, and no reason to pester business partners for ontology information which can already be obtained elsewhere.

Of course, the business will need to be involved in deciding how ontologies may be used and shared. Some recommend forming a “Semantics Board” or “Taxonomy Board” – however, leveraging an existing Data Governance and Stewardship organization can fill this need, and so issues related to standing up a new board can be averted.

One problem most organizations will not have is a dearth of embedded ontologies – and so in the ontology inventory, rank the ontologies to ensure a focus on the ones which will provide high value.

Part of the “ontology inventory” effort will be to identify, at a high level, how these embedded ontologies will be translated into RDF/OWL or translated into specialized ontologies built on RDF/OWL (SKOS, FOAF, etc). It is important to have a strategy to utilize RDF/OWL as these are Semantic Web standards. For example, ODM (Ontology Definition Metamodel) by the OMG (Object Management Group) could convert UML class diagrams into OWL, GRDDL (Gleaning Resource Descriptions from Dialects of Languages) can be used extract RDF triples from XML docs, D2R Map to convert data in an RDBMS into RDF triples, etc.

Some Technical Considerations

Currently (to a limited degree, e.g. Oracle), and more so in the future, RDBMS’s and RDF/OWL will be able to converge so that RDF/OWL triples can be stored in a RDBMS. SQL can be utilized, thus shortening the Semantic Data Web learning curve. The W3C standard for querying RDF/OWL is SPARQL (SPARQL Protocol and RDF Query Language) – by being able to leverage SQL, ontologies can be more readily queried even though the ontology is still in an RDF/OWL format.

The underpinnings of the Semantic Data Web is “open linked data” – where data is linked, not just web documents. This is facilitated by the use of URI’s to uniquely identify a data resource, and so a plan for identifying resources with a URI will be an important early step to investigate.

“RDF links enable you to navigate from a data item within one data source to related data items within other sources using a Semantic Web browser. RDF links can also be followed by the crawlers of Semantic Web search engines, which may provide sophisticated search and query capabilities over crawled data. As query results are structured data and not just links to HTML pages, they can be used within other applications.”

One issue to research is your current search engine’s ability to utilize ontologies built in RDF/OWL. Swoogle is an example of Semantic Web search engine. Other search engines may support aspects of thesauri, e.g. synonyms, word stems, acronym resolution, common misspellings, etc. Google has limited support for semantic search, using micro-formats (for applying a common format to things such as contact information, resumes, etc.) and RDFa (for embedding metadata in XHTML docs, e.g. Dublin Core elements) so that these docs can be more machine actionable and more understandable by humans.

Conclusion

The Semantic Data Web has a huge potential, but starting to leverage semantic technologies in an enterprise (for Internet or intranet usage) requires careful thought, business goals and objectives, and a phased implementation approach where business value is demonstrated in manageable iterations given the complexities and new technologies involved.

The World Wide Web is about connecting documents – the Semantic Data Web is all about connecting data for improved human / computer interaction as well as improved machine responsiveness. Data Management professionals need to be part of the discussion when implementing semantic initiatives so that their business and modeling knowledge and skills can be leveraged.

Peter Stiglich, CBIP

Pete Stiglich, CBIP, is a Principal Consultant with Data-Principles, LLC and has written and presented extensively on data architecture, data management, and Big Data. He is an AWS Technical Professional and a Hortonworks Architecture Professional. Pete also is an experienced trainer in data architecture and data modeling, and has a background in data governance and metadata management.

Getting Started On a Semantic Data Web Program

Peter Stiglich, CBIP

More DMU Knowledge

DMU Sponsors

DMU Partners

DMU Categories

Online Courses

Contact Us

Subscribe To DMU