Scalable business intelligence and analytics environments, including data warehouses, support increasing numbers of users, data access, and advanced analysis methods
“Scalability” is defined as the ability of an infrastructure to grow with the needs and usage of the solution. Most system architects are comfortable with selecting and implementing hardware that scales to meet operational solutions. Data warehousing and analytics involve users accessing information to perform analysis for unpredictable usage reasons, or for yet-unknown metrics. Typically, once a user sees information and is able to analyze that information effectively, that action generates many more questions were not evident during requirements gathering. What approaches are available that can also be scalable with the needs of the user?
Root Level Data Is Crucial to Success
Root level data is best defined as the native data elements as they should be captured. Ideally, if the data is captured in its native state, users can apply many future views to that data. The simplest example of this concept is the capture of birth date. By capturing someone’s birth date, an analyst can calculate the age of that person at the time of any event that may be tied to the business cycle. This results in the knowledge of unlimited dates. For example, knowing the date of a patient’s surgery enables comparison to the birth date to calculate their age at time of surgery.
The goal of root data capture is to enable many ways to look at the data. Get this root data right and many options are available. It is also easier and faster to gather root date. Rather than spending significant amounts of time during requirements trying to define all of the metrics, calculations, derivatives, and views, focus on the capture and integration of root level data and its associated metadata. The metric and view definition should come in the actual usage of the data, not in the initial storage. The design team may decide to store data in a summarized or aggregated manner to facilitate metrics, but never store it summarized or aggregated if the root data is not available. Any future needs that arise based on that root data cannot be satisfied without significant effort/rework.
Organizations that have a mature data warehousing environment that leverages comprehensive atomic level data can build many analytics solutions rapidly by focusing on the usage of root data already integrated. If the data is not truly integrated or modeled as root data, it will be impossible to realize true analytical value.
Divide and Conquer Data Modeling
Another key to success lies in having a subject area model as the first level of the data model. This is a requirement for all large or enterprise level data warehouses. Developing a subject area model does not require significant time or advanced effort. A subject area model should never become a massive undertaking, since “massive efforts” can be misrepresented and misunderstood. Without a simple yet complete subject area model, the development can take a “boiling the ocean” approach to data definition, or the team will design parts of the model by project. Either approach invariably results in re-architecting the model project after project (which in turn means rebuilding the database, tool access, and metrics).
Once a Subject Area Model has been completed, the team can build out the data model a subject at a time, or, parts of a subject (concepts) at a time. It is reasonable to model a subject by concept, but that approach requires some additional explanation. This divide and conquer approach to data modeling has to be driven by layers of priorities. To be effective, choose what to model from these factors:
Business Value – Have specific use cases and areas of analysis that would add immediate business value. These opportunities for business growth or improvement are identified at a high level and then the next 3 items are applied to them to enable an appropriate priority
Data Availability – Even though business value for an opportunity could be sky high, if the data to support it is not available or is not even captured, do not to start here.
Data Quality – The quality of the data captured is also critical to prioritizing opportunities. That does not mean the level of data must be high, but it must be appropriate enough to satisfy the business usage. Some efforts may require perfect data, while others just require access to whatever data exists.
Showstoppers – Most organizations have several stories of projects that were started and proceeded before being cancelled, partially completed, or flat out dropped in their tracks. There are many hidden showstoppers around analytics opportunities that must be understood before prioritizing efforts.
Separation of Data Integration and Data Delivery
In an enterprise data warehousing approach, it is extremely important to separate the data modeling and data integration activities from the application development or delivery of data via applications. By separating logical areas of the data by subjects (and concepts), the process can focus on the most valuable or easiest to acquire subject areas first, establishing the procedures and receiving real value without completing the whole project in one shot. While some see this as limiting, since one user will not get everything they want right away, it is actually the opposite. Many users will get more value initially and many more will get full value in the end. In comparison to other approaches, overall cost will go down and value will go up significantly.
This approach allows Data Integration resources to focus on getting all of the root level data and metadata acquired (one of the most difficult aspects to any data warehouse project), while the BI resources focus on defining the metrics and display mechanisms that the users need. This also helps enable many different applications and even many different user areas to be served from the same data. Each user area or group will be able to have their specific view of the data separate from how other areas view the data.
Scalability in hardware and development approaches is a value proposition that allows a team to implement the starting processes and begin to realize value, knowing that as future needs and resources permit, the effort can continue to grow and serve a broader audience. This also helps in seeking funding for larger efforts. No matter the business reasons for funding, it is always a challenge to justify large efforts. This approach justifies the first component, demonstrates success by delivering as promised, and justifies each future effort by understanding specific use cases around the next components of data that are integrated into the warehouse.