Customer data integration can be improved by using generic dimensional modeling techniques
If you are undertaking a Customer Data Integration (CDI) or Customer Master Data Management (MDM) project as part of a dimensional modeling endeavor, how will you tackle the problem of how to store customer addresses?
Using customer addresses for direct marketing, analysis of household penetration, customer retention or other types of Customer Relationship Management (CRM) analytics seems like a straightforward proposition – just add variable length fields such as address line(s), city, state/province, postal code, country along with customer name and identifying information to the Customer dimension.
However, scalability, flexibility and performance issues can be resolved by separating the address from customer (person/business) specific data (they are, after all, two distinct conceptual entities). Storing address (including geographic and address level demographics) data apart from customer level data in a generic (non-customer specific, reusable) dimension becomes especially important if you need to retain historical records, perform name and address matching or maintain multiple addresses per customer.
The benefits for this approach include:
- Address records can be reused
- Segmenting data based on stability
- Facilitating customer integration
- Handling multiple addresses for a customer
I will explore each of these benefits in more detail.
First, address reuse. It is very common for an enterprise to want to capture every member of a household in order to identify and segment customers. An address may be both a billing and shipping address, or may fulfill other roles, e.g. as collateral on a loan. Storing the address separately allows an address to be reused, thereby reducing space requirements. Assume a couple applies for a home equity line of credit (HELOC) at their bank. The bank wants to perform house-holding as well as identify every individual on the loan account. The house is collateral for the loan and so this collateral address needs to be captured. Additionally, the couple has a child with a student loan through the bank. Therefore, in this case five entities are associated with the address:
-1 piece of collateral
If the address is stored in the customer, household and collateral dimensions the address is duplicated five (5) times as in Figure 1.
Figure 1: Dimensional Model – Address Duplicated
Segment data based on stability
The second benefit provides the ability to segment data based upon stability. If you need to retain history of customer addresses, usually Type II Slowly Changing Dimension (prior version record retained, new version record inserted) method would be required. In a large environment, this can be problematic. Assuming the household moved (as approximately 18% of the US population does annually) there would now be 6 records in the Customer dimension and 2 records in the Household dimension in Figure 1. The customer/household level data (name, demographics) is duplicated.
A Data Warehouse best practice is to identify how each attribute in a dimension will handle change. Separating data based on stability means vertically partitioning the record into two (2) or more distinct entities based on the rate of change. Person specific information such as name, taxpayer_id, date of birth is changed infrequently, address may change occasionally but demographics may change very frequently.
The Customer dimension in Figure 1 could be separated into a Customer dimension which contains only relatively static data (e.g. name, taxpayer_id, date of birth) handled by the Type I Slowly Changing Dimension method (overwrite existing values), an Address dimension which is generic in nature (only contains address and address level geographic and demographic data not specific to a particular customer) where no updates are required (except perhaps rarely for changes in address demographic or census data) and a Customer Demographics dimension which could be a Cartesian product of all possible combination of customer level demographics, thereby again eliminating the need to perform updates.
Figure 2: Dimensional Model – Generic Address Dimension
In Figure 2, the Address dimension fulfills two roles in the Customer Account fact table – Mailing Address and Household Address. Address_key is also migrated to the Collateral dimension (there may be multiple pieces of collateral on a loan, hence the need for a bridge table to resolve the many to many relationship).
Facilitating customer integration
The third reason for creating a separate address dimension is to accommodate the numerous address specific components (e.g. Carrier Route Code, Metropolitan Statistical Area (MSA) code, Latitude/Longitude, Census data) that can be generated using databases that provide specific geographical data, and that an organization can match/merge against their records using a data cleansing process.
These address specific geographic and demographic fields generated during the Data Cleansing geo-coding process are very useful for marketing and analysis. Numerous attributes may be generated from Data Cleansing parsing, geo-coding and standardization (discussed in more detail below). Since an address might be unknown (or invalid according to Data Cleansing databases) for a customer, the resulting table could very wide with many columns unpopulated.
To perform name and address matching, a Data Cleansing tool will parse the name and address data into many distinct components, e.g. first name (original and standardized), last name, middle name (original and standardized), street number, direction, street name, street type and so on. Standardization is also performed on some of these components in order to facilitate matching, e.g. Bill becomes William, Beverly Hills becomes Los Angeles, etc.
If you are envisioning requirements of “real-time” customer name and address matching, it is recommended to store the results of parsing, geo-coding and standardization with the data so this work will not have to be repeated each time the customer record may be a potential match. Many of the attributes resulting from parsing, geo-coding and standardization are not needed for analytical purposes in dimensional schemas, but are more suitable to a CDI hub, a normalized Data Warehouse or a persistent data staging area which in turn becomes the data source for the Customer, Address and Customer Demographics dimensions.
Assume we have a customer named Bill R. Smith who lives in a zip code with a dense population (e.g. 10001) which has many other Bill or William Smiths. To perform name and address matching, all of the existing William Smith records in this zip code need to be parsed, geo-coded and standardized. After this occurs, matching is attempted using the incoming and existing records according to business rules.
Customer Matching Example
Incoming customer record
Bill R. Smith 1000 N. 5th Street, NY, NY
Existing potential customer matches
- William J. Smith 50 N. Broadway, New York, NY 10001
- William R. Smith 1000 N. 5th Street, Apartment 5B, New York, NY 10001
- Will X. Smith, PO Box 100, New York, NY 10001
- Wilber Smithson, 35 S. Main Street, New York, NY 10001
Data Cleansing tools may use a “window key” or “break key” in order to group records with similar characteristics in order to reduce the set of potential matches. This key may be comprised of part of the postal code, customer initials, x number of consonants of the first and last name, or whatever the business requirements are. As the number of customers increase, the composition of the window or break key may need to be revisited in order to provide a more granular grouping of potential matches.
For the above records, the “key” might look like 10001SMWL. Notice the last record is not a William Smith but because the zip code, initials, and the subsequent consonant of the last and first name are the same this customer has the same “key.”
Also notice that the incoming record does not have a zip code – if the address supplied is valid for New York, the Data Cleansing tool can look up the zip code and attach it to the record. As you can see, a lot of processing is involved in matching and by storing the results of parsing, geo-coding and standardization in the target database expensive parsing and lookup operations can be avoided, which is particularly critical for a “real-time” environment.
Handle multiple addresses per customer
The fourth benefit of creating a distinct, generic (reusable) address entity is to be able to resolve multiple addresses per customer. An address may be associated with multiple customers, may fulfill many roles, (e.g. billing, mailing, seasonal, shipping, collateral, business or home) and a customer may have multiple addresses, thereby creating a many to many relationship between customer and address and a many to many relationship between address and address type or role . Retaining multiple addresses per customer increases complexity from both a business and technical perspective, but it does provide a more accurate picture of the customer, fosters better relationships with customers and provides more data for customer matching. Since I am paranoid (justly so) regarding having mail stolen and so my mailing address is different than my home address. I received a letter from my bank (one of the largest in the country) stating that they were only able to keep one address on file (which was my home address, per Anti Money Laundering (AML) legislation as part of the Patriot Act) and therefore my bank statements went to my home address instead of my mailing address. I’ve subsequently changed banks. To resolve these many to many relationships, an associative entity is required in the CDI hub or Data Warehouse/Data Staging area, perhaps something like Figure 3.
Figure 3: Normalized CDI hub – Associative Entity to resolve M:M relationships
Notice that effective date is part of the primary key. Also, another entity, Address Source, can help provide some breadcrumbs to identify the customer interaction or other source (e.g. National Change of Address system (NCOA)) from where the address for the customer originated.
If creating a generic address entity in your CDI hub or Data Warehouse environment seems like a lot of work, you are correct. There are compromises to be considered – additional complexity in the ETL and auditing processes versus better performance on the front and back end, additional flexibility and reduced space requirements.
Whether you are considering a Customer Master Data Management solution or have begun the design effort, it may be worthwhile to lay the groundwork for future flexibility and scalability by modeling a generic Address dimension before moving forward with your implementation. Retrofitting an existing environment with a change of this magnitude once the system is in production can be a daunting task.
Even if some of the functionality mentioned above is not utilized (e.g. storing multiple addresses per customer), having the groundwork in place can greatly simplify future implementations if this functionality is needed later.
While a primary objective of a dimensional modeling is the ease of understanding that it fosters, using a generic model for an Address dimension makes sense from a performance perspective – the other primary objective for dimensional modeling – and can actually simplify understanding due to the multiple, symmetrical roles that the Address dimension can play in relation to a fact table. In a CDI hub, a generic Address entity will facilitate integration performance by reducing the Data Cleansing operations required in addition to supporting scalability and flexibility requirements.