Affiliated with:

Improving the ROI of Big Data and Analytics through Leveraging new Sources of Data

feature image 2

Investing in new sources of data can help to unravel complex customer behavior and improve key analytical insights


Big Data and Analytics are all around these days. Most companies already have their first analytical models in production and are thinking about further boosting their performance. Far too often, they focus on the analytical techniques rather than on the key ingredient: data! Perhaps the best way to boost the performance and ROI of an analytical model is by investing in new sources of data which can help to further unravel complex customer behavior and improve key analytical insights. The following is a brief exploration of various types of data sources that could be worthwhile pursuing to squeeze more economic value out of your analytical models.

Network data

A first option concerns the exploration of network data by carefully studying relationships between customers. These relationships can be explicit or implicit. Examples of explicit networks are calls between customers, shared board members between firms, and social connections (e.g., family, friends, etc.). Explicit networks can be readily distilled from underlying data sources (e.g., call logs) and their key characteristics can then be summarized using the development of additional features to a set of procedures that can deliver new characteristics, which can be added to the modeling data set. In previous research, network data was highly predictive for both customer churn prediction and fraud detection. Implicit networks or pseudo networks are a lot more challenging to define and enhance. Martens and Provost (2016) built a network of customers where links were defined based upon which customers transferred money to the same entities (e.g., retailers) using data from a major bank. When combined with non-network data, this innovative way of defining a network based upon similarity instead of explicit social connections gave a better view and generated more profit for almost any targeting budget. In another, award-winning, study they built a geo-similarity network among users based upon location-visitation data in a mobile environment . More specifically, two devices are considered similar and thus connected, when they share at least one visited location. They are more similar if they have more shared locations as these are visited by fewer people. This implicit network can then be leveraged to target advertisements to the same user on different devices or to users with similar tastes, or to improve online interactions by selecting users with similar tastes. Both of these examples clearly illustrate the potential of implicit networks as an important data source. A key challenge here is to think creatively about how to define these networks based upon the goal of the analysis.

External data

Data are often branded as the new oil. Hence, data pooling firms capitalize on this by gathering various types of data, analyzing them in innovative and creative ways, and selling the results thereof. Popular examples are Equifax, Experian, Moody’s, S&P, Nielsen, and Dun & Bradstreet, among many others. These firms consolidate publically available data, data scraped from websites or social media, survey data, and data contributed by other firms. By doing so, they can perform all kinds of aggregated analyses (e.g., geographical distribution of credit default rates in a country, average churn rates across industry sectors), build generic scores (e.g., the FICO in the US) and sell these to interested parties. Because of the low-entry barrier in terms of investment, externally purchased analytical models are sometimes adopted by smaller firms to take their first steps in analytics. Besides commercially available external data, open data can also be a valuable source of external information. Examples of open data are industry and government data, weather data, news data, and search data (e.g., Google Trends). Both commercial and open external data can significantly boost the performance and thus economic return of an analytical model.

Macro-economic data

Macro-economic data are another valuable source of information. Many analytical models are developed using a snapshot of data at a particular moment in time. This is obviously conditional on the external environment at that moment. Macro-economic up- or down-turns can have a significant impact on the performance and thus ROI of the analytical model. The state of the macro-economy can be summarized using measures such as gross domestic product (GDP), inflation and unemployment. Incorporating these effects will allow us to further improve the performance of analytical models and make them more robust against external influences.

Textual data

Textual data are also an interesting type of data to consider. Examples are product reviews, Facebook posts, Twitter tweets, book recommendations, complaints, and legislation. Textual data are difficult to process analytically since they are unstructured and cannot be directly represented into a matrix format. Moreover, these data depend upon the linguistic structure (e.g., type of language, relationship between words, negations, etc.) and are typically quite noisy data due to grammatical or spelling errors, synonyms and homographs. However, they can contain very relevant information for your analytical modeling exercise. Just as with network data (see above), it will be important to find ways to add features to text documents and combine it with your other structured data. A popular way of doing this is by using a document term matrix indicating what terms (similar to variables) appear and how frequently each is evident in which documents (similar to observations). It is clear that this matrix will be large and sparse. Therefore, dimension reduction will be very important as the following activities illustrate:


Even after the above activities have been performed, the number of dimensions may still be too big for practical analysis. Singular Value Decomposition (SVD) offers a more advanced way to do dimension reduction. SVD summarizes the document term matrix into a set of singular vectors (also called latent concepts) which are linear combinations of the original terms. These reduced dimensions can then be added as new features to your existing, structured data set.

Besides textual data, other types of unstructured data such as audio, images, videos, fingerprint, GPS, and RFID data can be considered as well. To leverage these types of data successfully in your analytical models, it is of key importance to think carefully about creative ways of enhancing them. When doing so, it is recommended that any accompanying metadata are taken into account; for example, not only the image itself might be relevant, but also who took it, where, and at what time. This information could be very useful for fraud detection.


To summarize, the best way to boost the performance and ROI of your analytical models is by investing in data first! There are alternative data sources which can contain valuable information about the behavior of your customers that can support analytics to allow your organization to achieve competitive advantage.


Bart Baesens, Ph.D.

Bart Baesens, Ph.D. is a professor of Big Data & Analytics at KU Leuven (Belgium) and a lecturer at the University of Southampton (United Kingdom). He has done extensive research on big data and analytics, credit risk modeling, customer relationship management and fraud detection. His findings have been published in well-known international journals and has been presented at leading international conferences. He is author of various books including Analytics in a Big Data World see ( and Fraud Analytics using Descriptive, Predictive and Social Network Techniques see ( His research is summarized at

© Since 1997 to the present – Enterprise Warehousing Solutions, Inc. (EWSolutions). All Rights Reserved

Subscribe To DMU

Be the first to hear about articles, tips, and opportunities for improving your data management career.