Data manipulation, analysis and data management combine with scientific methods, processes, and systems to extract knowledge or insights from large amounts of data
The phrase “data science” is akin to the proverbial elephant and the eight blind men – each one imagining what an “elephant” is by what he feels in the small area to which he has access. For some, data science is statistical analysis on a large scale, to others it is a heightened form of data analysis that uses unstructured as well as structured data, and to another group it may be a combination of statistics and business rules. Actually, data science is all of these activities and more, and it has been in existence for much longer than most information management professionals realize.
It is important to understand data science, its components, and how the practice of data science requires a robust enterprise data management program to be successful, since the explosion of data will continue and require more attention
Components of Data Science
As defined by Webster, “data science” is a field within computer science or information science, which seeks to provide meaningful information from large amounts of complex data. Data Science combines different fields of work in statistics and computation in order to interpret data for decision making.
There are many areas that contribute to the activities of data science. These include, but are not limited to:
- Data mining – applying business rules and algorithms to a data set (or sets) that may reveal patterns within the data set; allowing relevant data to be extracted for decision making, analytics performed against identified data sets
- Statistical methods – techniques such as predictive analytics and other statistical models that use extracted data to determine probability of future events based on historical occurrences.
- Computer science – programming languages such as R, Python, SPSS, etc… as well as database programming languages (e.g., SQL); to enable data collection, refinement /manipulation, and display according to requirements
- Artificial Intelligence – machine learning capabilities that enable a computer to process extremely large quantities of data, much more than a human could analyze in the same time, for operational or decision making purposes
- Data visualization – the presentation of data in a pictorial or graphical format so it can be analyzed easily
- Actuarial science – a subset of statistics focusing on insurance (auto, building, health, life, etc.) using models and other various criteria to determine insurance premiums
The data scientist interprets, converts, and summarizes the data into a cohesive package that the decision-making group can understand. It is important to note that most data scientists possess high domain knowledge in the area they practice (finance, operations, healthcare, insurance, etc…). Often, the result of a data scientist’s efforts is a transformation of large amounts of raw data into useful insights and compelling stories to support decisions.
Unlike many other professions that require analysis skills, data scientists are expected to have business knowledge, which is a reason that the best data science / analytics programs require some courses in business / management. Many senior data scientists / analysts become entrepreneurs or consultants, since they can develop a wide range of skills and experiences from their varied projects.
Figure 1. What is Data Science?
Data Science as Part of Big Data
In many organizations, data science is seen as the analytical component of the big data initiative. It can be argued that no big data effort could succeed without a robust data science program, using the techniques mentioned above. However, many big data programs are lost in the continual extraction, transformation, and loading of data and the analysis of the data sets lags significantly. For those organizations that want to exploit the capabilities of data science fully, it should be separated from the big data program and given its own role in the organization. This separation will enable the data science program to contribute to all big data efforts as a distinct unit and allow the data science program to offer its analytics capabilities to the entire organization, not restricting it to supporting only big data projects.
Data Science Process
There are many processes that can comprise a data science effort, and some apply to certain aspects of data science more than others do. However, the following diagram provides a fine overview of the steps most organizations employ when practicing general data science.
Figure 2. The Data Science Process
Challenges of Data Science
Data science programs go beyond traditional departments in statistics, finance/accounting, computer science, and business intelligence by requiring that practitioners have a broad set of both technical and soft skills. However, not all those who aspire or practice data science can combine the necessary balance among analytics, insight, and written and oral communications that effective data science needs.
Some programs focus their hiring on specific tool-based skills and application areas. These programs have data scientists who can start work on well-defined projects quickly, but who struggle to grow into larger, more visible roles in the organization. Other programs that include job requirements for critical thinking and strong communication skills will find practitioners who will be able to work outside their initial comfort zone, and who can learn, grow, and adapt as the field and the organization change.
Data science brings a distinct set of challenges to business communication. How does one explain statistical evidence and analytical results without oversimplifying or creating confusion? Effective data scientists must be able to weave results into a coherent story, to explain statistical assumptions and limitations clearly, and should be able to create data visualizations that give insight to a wide variety of consumers. A major challenge for any data science endeavor is balancing the needed skills across the group of analysts, engineers, designers, and managers.
Enterprise Data Management and Data Science
Although data science, and big data, require well governed and clean, organized data for use in analytics, most organizations do not have an enterprise data management program to support their data science and / or big data initiatives. This lack is a major contributor to the failures experienced by a large percentage of big data efforts.
There are some distinct alignments between data science and the domain areas of enterprise data management. Since the role of a data scientist is to “understand data,” it is important that the data is well governed, organized, clean, and managed to be useable for analysis and decision-making. That organization, cleanliness, management, governance are the responsibilities of an enterprise data management function.
Figure 3. Enterprise Data Management and Data Science Interactions
Without properly organized, sustained, and robust enterprise data management, no data science or big data effort can fully succeed. Starting with data governance, metadata management, and enterprise data architecture, every organization that decides it needs data science and / or big data should have a solid, best-practices based enterprise data management program.
Many data scientists are connected indirectly to enterprise data management through their use of the organization’s data warehouse and business intelligence environment. However, many DW/BI initiatives were not designed with fully defined and governed business metadata, so many data scientists and analysts may discover challenges when working with these less-optimal sources. Data scientists may influence the development and sustainment of enterprise data management (data governance, metadata management, data quality management) through their need for high quality data.
Although the term “data scientist” may not have been in the Human Resources lexicon for as long as the words “accountant,” or “programmer,” it has become a valuable role in many organizations. Data science is important if enterprises in every industry, of every size, want to learn more about themselves, their competition, their markets/span of influence, etc… and how their past will affect their future.