Enterprise data modeling should be an integral part of every organization’s data management practices. There are several effective approaches to useful enterprise data modeling.
Enterprise data modeling (EDM) was scorned and earned a negative reputation for many reasons, including the length of time needed to model the entire enterprise, and the lack of effect the model had on how systems were built. It was perceived to be a wasted effort, and the practice was abandoned by most companies. Eventually, it became clear that practitioners had abandoned sound data management principles for tedious and ineffective approaches to information management.
As a result, disciplines were formed from areas in the foundations of data administration to address the enterprise data needs, such as metadata management, data governance, data stewardship, master data management, etc. These new disciplines all strive to achieve the same goal as EDM, namely to create and maintain an inventory of data assets that are unique, accurate, reusable, and traceable. Does enterprise data modeling still have a place in the data management practice? Experts believe it does, perhaps with the implementation of some aspects of these disciplines and adjustments to the original EDM approach.
Big-Bang Versus Incremental
Originally, organizations that embarked on EDM followed the best practices of that time and attempted to model the entire enterprise all at once. The business people who participated in the modeling sessions gained invaluable insights into the business processes and the data of the organization, but they had little to show for their efforts other than many entity-relationship diagrams decorating their cubicle walls. Creating the EDM was never part of any system development project because the big-bang approach took too long and delayed the implementation of new systems. By the time the EDM was completed, the systems had already been developed independently and without adherence to the EDM. When funds and other resources became scarce, business people wondered what enterprise data modelers contributed to the bottom line, and many EDM efforts were stopped.
Big-Bang Advantages and Disadvantages
Advocates of the big bang approach still argue that the benefits of big-bang EDM outweigh its shortcomings. They note that unless all major business entities and their relationships are identified together, the EDM will contain errors and must be revised. They also declare that unless all the attributes pass all the normalization rules, some of the attributes are guaranteed to be misplaced and must be moved. In addition, they state that unless all the definitions, domains, and business rules are defined for all the attributes, the EDM cannot be considered complete or accurate. Moreover, if that EDM, fraught with all those errors, were to be used to design one or more application databases, those databases would have to be redesigned as the errors are discovered, which would have a domino effect on all the programs accessing those databases. While there is some truth to all these points, the arguments against a big-bang approach far outweigh these risks.
There are two practical arguments against the big-bang approach. First, it takes much too long to find and resolve all the conflicting data views and business rules that currently exist among business people, not to mention finding and resolving all the data redundancies and data inconsistencies that currently exist among systems. Second, it would involve too many participants from too many business areas (e.g., various data owners, data stewards, subject matter experts, and information consumers) to be involved in all the modeling sessions. The coordination of people’s availability and meeting-room availability alone would be a major challenge, not to mention the facilitation effort to keep these meetings productive and engaged. The big bang approach required all major participants be involved throughout the data modeling process, otherwise the EDM could not be considered complete or validated. However, this became a physical impracticality.
Incremental Advantages and Disadvantages
The arguments against an incremental approach to building the EDM are the same arguments used to defend the benefits of the big-bang approach.
Building the EDM incrementally means to build a project-specific, fully normalized, fully attributed, fully defined, and fully documented logical data model during the development of each system, but to limit the logical data modeling activities to only those data elements that are in the scope of that project. Assuming that no prior EDM exists, the first project-specific logical data model is the first iteration of the EDM. The second project-specific logical data model, which is also limited to the data elements that are in the scope of the second project, is then merged with the first one to produce the second iteration of the EDM. Any discrepancies, ambiguities, or inconsistencies between the two models are presented to the participants of both projects to be resolved. This process is repeated with all subsequent projects. This way, over time, an EDM can be achieved with incremental value realized from each project, because the systems that are built now share the same data names, data definitions, business rules, etc.
Note: To avoid excessive rework, the normalization rules must be followed without exception. That is the only way to ensure that the disjointed pieces of the EDM will fit seamlessly as they are assembled from the various project-specific logical data models over time.
The benefits of incrementally building the EDM can be reaped immediately. The additional effort and time it takes to produce a project-specific logical data model is miniscule in comparison to a big-bang effort. This approach involves fewer participants at one time because of the reduced project scope. The validation time is also shortened due to the reduced number of participants. Business rules and business metadata can be collected, validated, refined, and revised slowly over time, which comfortably accommodates the rate of change in business policies in most organizations. Errors, misunderstandings, omissions, hidden obstacles, and other unavoidable issues can be resolved incrementally, which gives the organization time to mature into a learning organization; one that has increased capacity to adapt to changes rapidly.
Top-Down versus Bottom-Up
Another debate that often occurs is the topic of top-down versus bottom-up logical data modeling. This is an ill-conceived debate because both techniques must be applied to ensure a robust and realistic logical data model.
Top-down logical data modeling refers to the technique of facilitating fact-finding discussions among participants from various business areas, who own, originate, or use the data. The logical data modeler assigned to the project is usually the facilitator of these discovery sessions. He or she helps the participants to identify the major business entities, the data relationships among them, the unique identifiers of those entities, the most significant and obvious attributes of those entities, and the main business rules for those attributes.
This top-down fact-finding activity produces a first-cut “straw-man” logical data model. The logical data modeler and one of the participants then go through several iterations of revising and fleshing out the details of this project-specific logical data model, while writing the definitions for the entities and the attributes. Throughout this process, they periodically call back the participants to validate their work using follow-up top-down discovery sessions. Once the logical data model for the project is deemed to be finished, the “forward-engineering” function of the data modeling tool, which was used to capture the logical data model, can then produce the first-cut physical data model that the DBA can use to start his or her database design activities.
Using the top-down data modeling technique produces a relatively valid logical data model in a relatively short time. However, there are several inadequacies using this technique alone.
The project-specific logical data model will most likely not be complete because it is impossible to think of all the details during discovery sessions without performing some amount of “bottom-up” analysis of the operational source files, databases, and programs. That type of analysis often reveals additional attributes, relationships, and business rules, and in some cases even additional entities.
The project-specific logical data model will most likely not reflect the correct representation of all the data because it was based on the participants’ understanding of their business or their industry and on the business, rules they thought “should” govern the data. Analyzing the existing source files and databases often reveals that some of the stated business rules are either not enforced or incorrect.
Top-down discovery sessions are extremely effective for finding and resolving many high-level issues with diverse participants in a short time, but they are not helpful when performing detailed data analysis and research. On the contrary, the more people that try to analyze the same thing, the longer it will take. In addition, participants get anxious and feel they are wasting their time when the research activities bog down in laborious details.
Bottom line, top-down logical data modeling only goes so far before it has to be augmented with bottom-up analysis.
Bottom-up logical data modeling refers to the painstaking task of normalizing existing process-oriented (de-normalized) data structures into a “best-guess” logical data model. The steps include:
- Reviewing record layouts of files and the DDL of relational databases to locate potentially embedded (hidden) entities.
- Tracing through primary and foreign keys to locate inferred entities and their potential data relationships to other entities.
- Painstakingly applying the normalization rules to each data element.
- Converting technical column and table names to fully qualified business names and assigning prime, qualifier, and class word components. (An example of a standardized business name for a data element is: Monthly Checking Account Balance. The main component (prime word) is “Account” which is further qualified by the word “Checking” to indicate the type of account. The class word or type of data value contained in the element in this name is “Balance” which is further qualified by the word “Monthly” to indicate the type of balance.
Sometimes, this process is referred to as reverse engineering, which is a common feature of data modeling tools. However, the reverse-engineering capabilities of these tools only cover a small portion of the 4-step process for bottom-up logical data modeling – and only if the existing data structure is a relational database. Using the reverse engineering function, the data modeling tools produce a physical data model of the underlying database by tracing through the primary and foreign key relationships between tables in the database (step 2), and then simply calling the tables “entities” and showing the traced key relationships as data relationships among the entities. The tools do not automatically perform any analysis, normalization, or business name qualification. In some cases (for example, in case of a normalized database implementation), such a physical data model can be a useful starting point to approximate “best-guess” logical business entities. The logical data modeler can use these inferred potential entities to continue the normalization process. But without further normalization, a reverse-engineered physical data model is never equivalent to a true bottom-up logical data model, and thus could not – and should not – be used “as-is” for EDM. And it is doubtful that tools will ever be able to mimic the human reasoning and judgment processes that are intrinsic to bottom-up logical data modeling.
As just described, using the four steps of the bottom-up logical data modeling technique produces a relatively complete project-specific logical data model. However, there are also several inadequacies using this technique alone.
The bottom-up approach “best-guess” logical data model cannot be trusted until it is meticulously reviewed and validated by all the business people who own, originate, or use the data.
The project-specific logical data model is immediately very large and complex because it includes all the details from the existing source files and databases from the start. This makes it overwhelming for some business people to review the logical data model, and some even refuse to participate in any review sessions. This immediate complexity is also problematic for some less-experienced logical data modelers who can easily get lost in the ambiguities and open questions during the iterative refinement process.
The project-specific logical data model will most likely not reflect the correct representation of all the data because it was based on the built-in structural limitations of the existing files and databases, and on the programmed-in business rules of the existing applications. When business people are asked to validate these derived logical data models, they frequently point out incorrect or missing business rules, data relationships, attributes, and even entities.
However, bottom-up logical data modeling only goes so far before it has to be augmented with facilitated top-down validation. Both the bottom-up and top-down approaches bring solid benefits, but each can be incomplete without the other.
Data reusability requires having an inventory of all unique business entities and their unique attributes, which in total represent the “single version of the truth” about the organization. This inventory can be achieved through the development of an enterprise data model, created incrementally. An EDM can be created incrementally, as it evolves over time when the project-specific logical data models are merged one by one, a project at a time. The project-specific data models must use a combination of top-down and bottom-up logical data modeling techniques to be complete and correct. Using these techniques, and including the six normalization rules so that each unique atomic attribute is located in its one – and only one – owning entity, is the most reliable approach.