Many data integration and analytics efforts use source-to-target mapping to identify data and related metadata. Developing a Source-to-Subject mapping can be an alternate approach
High level planning is always feasible for major initiatives. Whether planning a vacation or a significant development project, it is a good idea to identify the main goals, the costs, and the benefits that will be received from completing the effort. It is also a good idea to list the major activities or tasks that one expects to perform.
Enterprise data warehousing / business intelligence / analytics efforts can benefit from planning and defining an incremental approach to constructing the solution. It is also helpful to have a tracking mechanism that maps data sources to business subjects as a high level reference.
Subject Area Model versus Source-to-Target Mapping
One of the foundational elements of sound data design for an enterprise data warehouse is the Subject Area Model (SAM). A SAM is a data classification tool that identifies the top 10 – 30 subjects of data that define an organization. These subjects are used in classification, planning, data governance, metadata, and data warehouse development.
Figure 1: Sample Subject Area Model
A common document used in the design and development of ETL jobs is the Source to Target Map (S2T). A S2T map is used to define the data that will be pulled from source systems, how it will be validated, transformed, and ultimately, where it will reside in the target data warehouse or integrated database. Source to Target maps are more detailed than a Source to Subject map.
Figure 2: Sample Source to Target Map
There are many examples of enterprise data warehouse development efforts marred by projects that are under development, sometimes even testing, before realizing data was not available, accurate, or valuable. Most of these issues have their origins in good intentions and the desire to proceed quickly so significant progress can be made. Unfortunately, without a quantified, prioritized list of sources and use cases, these projects often spend inordinate amounts of time and funding only to be unable to deliver a solution that realizes significant value. Too often, it is a case of asking one user what they want and starting to build without understanding the viability of the request and its value to the rest of the organization.
Generally, the trend in design is to move straight into the creation of detailed data models and then source to target maps. In many industries, the number of source systems and the amount of data to model /define is small enough where this approach is adequate. However, in industries like healthcare, where dozens or even hundreds of source systems exist in one corporation, it isn’t a feasible solution. Some organizations or industries have the challenge of a massive landscape of solution demands and needs that stretch across business areas, operational responsibilities, external demands, finances, and research. Unfortunately, this is often where the concept of boiling the ocean through detailed data models rears its head.
Source to Subject (S2S) Mapping
Rather than abandon the data mapping totally, try refining the approach. Use a Source to Subject map (S2S).
If the goal is to build a data warehouse incrementally, there are two choices – build it one source system at a time, or build it based on the target one subject at a time. For data that is truly integrated, experts recommend building one subject at a time, modeling and designing the identified data target for initial value. Choosing to model based on individual sources will cause the model to resemble the sources, and integration of additional systems will be extremely challenging or even impossible to achieve.
For example, many healthcare organizations can satisfy significant cross functional needs by gathering data on these subjects quickly – patients, providers, diagnosis, medications, procedures, and appointments. With a Subject Area Model (SAM) that describes the areas of data in the organization, the ability to develop a data warehouse incrementally is enabled one subject at a time. Complete the S2S Mapping exercise before beginning any detailed data modeling efforts. This will help connect the source content and quality of the data to appropriate use cases that demonstrate clear value while creating sufficient information that can develop a validated prioritization of data and usage scenarios.
An effective source to subject map consists of two main areas – the sources and subjects.
Sources: Should not necessarily comprise every system in the organization, but should include the systems that contain the data that is of value to your organization. Again, this is most applicable to industries where data can reside on many different systems. Identifying the “gold standard” sources can help whittle the list of key sources dramatically. Information captured should include; history, volume, key content, source experts, perceived quality and any other essential metadata.
Subjects:This consists of the business subjects identified in the subject area modeling exercise. All the descriptions, examples of critical data, and any associated metadata related to them.
Validating and Prioritizing
Effective prioritization of the solutions produced by a data warehouse is a responsibility that rests with organizational leadership. However, IT should produce all materials that enable the warehouse oversight group to make educated decisions that are reasonably achievable. To avoid projects that never succeed or are never launched, no prioritization of efforts should occur without having information about the data availability, data quality, costs of acquisition, time to deliver, and project risk.
From an S2S listing of the data sources mapped to the subjects, it is possible to begin a process of identifying which use cases map to data that is available and of a sufficient quality for usage. To keep detailed requirements gathering to a minimum, at this stage only high-level use cases that have some specific samples are needed for this step. After mapping use cases to data, the oversight team can look at the value of the use cases, the correlations to how many use cases can be solved by implementing data needed by many rather than just one, and examine the time it would take to integrate the data.
This process is finalized by creating a roadmap of the data, solutions, and processes that are appropriate, spread over the next few years.
Creating information based prioritization processes that interact with the leadership and authority that oversees data warehousing efforts is critical for realizing the most business value. By having a business driven planning process that leverages a high-level Source-to-Subject (S2S) map before any development begins, a team can significantly increase the odds of project success and deliver the business value for future users of the system.