The Nobel Prize winning scientist Albert Szent-Gyorgyi once said, “Discovery consists of seeing what everybody has seen, and thinking what nobody has thought.” It’s a quote emphasizing the importance of both observation and original thought within the discovery process. It implies that while the raw material for discovery may be readily available to all, the ability to interpret that material in unique ways is what leads to important breakthroughs. Discovery is the very essence of data mining, almost its raison d’être, one could say. These innovations are usually the ones that prove to be the most lucrative and financially rewarding.

The term “data mining” emerged in the late 1980s and early 1990s as computer scientists and statisticians developed new methods for extracting patterns and insights from large datasets. However, early on, some statisticians viewed the practice skeptically, criticizing it as an exercise in “data dredging” or a “fishing expedition,” implying that researchers were searching through data without hypotheses until they found something significant—regardless of the result’s validity.

Statisticians tongue-in-cheekily accused analysts of “mining the data until it confessed”. An accusation that might be overly harsh but accurately reflects concerns about analysts’ tendency to overfit and massage their data to discover spurious patterns when analyzing large datasets without proper statistical rigor.

What is Data Science?

According to Domo, the U.S. cloud-based business intelligence (BI) and data visualization platform, “Data science is the process of extracting actionable insights from large amounts of data using tools like the scientific method, statistics, analytics, programming, machine learning, and deep learning. The goal is to see patterns in the data that might be missed at a glance, pull useful information from that data, generative predictive insights, and use that information to increase business intelligence (BI) and make better business decisions.”

The field of data science combines data mining, machine learning, and statistical analysis to extract insights and meaning from data sets. Data science relies on effective data collection, warehousing, and processing for analysis and is a key step in the knowledge discovery in databases (KDD) process, which involves data preparation, mining, and post-processing.

The Data Mining Process

Data mining is the process of analyzing large datasets to discover patterns, correlations, and insights that drive decision-making. It combines techniques from statistics, machine learning, and database systems to extract hidden knowledge from raw data. The data mining process typically involves seven key steps:

  1. Business understanding: Identify the problem or question, assess feasibility, and set success metrics.
  2. Data Collection: Identify data sources and gather structured or unstructured data.
  3. Data Preprocessing: Data cleansing, data transformation, and data reduction.
  4. Exploratory Data Analysis: Apply statistical, machine learning, or AI algorithms to the prepared data in order to discover patterns, build predictive models, or identify relationships.
  5. Model Evaluation: Assess the effectiveness and validity of all models against the business objectives and test datasets; ensure the findings are reliable and actionable before deployment.
  6. Deployment & Monitoring: Implement the results and insights into real-world business processes, such as decision systems or reporting dashboards, to drive value or solve the original business problem.
  7. Validation: Test models for accuracy.

The Five Eras of Data Mining

From the discovery of Bayes Theorem in 1763, which established probabilistic inference in the pre-computer age, through the development of the first relational database in the 1970s, onto the emergence of data mining in the 80s, to the commercial boom of the 1990s and 2000s, and ending (for now) with the AI revolution of today, data mining has had a long and impressive history.

Pre-Computer Age

In 1854, Boole’s “Laws of Thought” laid the groundwork for algebraic logic systems, specifically Boolean algebra. Boole’s work established a system where logical operations (like AND, OR, NOT) could be represented and manipulated using algebraic equations. This system uses binary values (0 and 1) to represent truth and falsity, and its rules mirror those of ordinary algebra, making it a powerful tool for logical reasoning and computation. 

Database Foundations

The foundations of today’s database systems go back to the 1950s, when Alan Turing proposed the “learning machine” concept, which refers to the idea of a machine that can improve its performance over time through a process akin to learning, without being explicitly programmed for every eventual detail. In his 1950 paper “Computing Machinery and Intelligence,” Turing proposed that such a machine (a “child machine”) would start with a simple structure but could be trained and educated, much like a human child, using methods of “scientific induction” to acquire knowledge and improve its behavior over time.

In 1962, John Tukey pioneered exploratory data analysis, revolutionizing how insight was extracted from raw data. Besides being a brilliant mathematician, Tukey had a wit about him as well. Quotes like, “The best thing about being a statistician is you get to play in everyone’s backyard” and “Far better an approximate answer to the right question than an exact answer to the wrong one.” The latter something to keep in mind today when working with analytics, especially machine learning and deep learning.

In 1973, IBM developed the first relational database (System R). The first implementation of the relational database model, it introduced SQL (Structured Query Language), which became the industry standard for querying relational databases. System R demonstrated that relational databases can offer efficient transaction processing and featured innovations like a cost-based query optimizer, multi-user support, locking mechanisms for concurrency control, and storage/indexing techniques. Although never commercialized itself, System R laid the foundation for IBM’s later commercial products like SQL/DS and Db2, and its design heavily influenced modern relational database systems.

Emergence of Data Mining

As the field matured, the term “data mining” gradually shed its negative connotation. The Ian H. Witten, Eibe Frank, and Mark Hall book, Data Mining: Practical Machine Learning Tools and Techniques, first published in 1999, played a significant role in popularizing the term and legitimizing the field within both academia and industry. Its value in business, science, and technology went from an interesting diversion to a corporate profit center.

In 1989, the first Knowledge Discovery in Databases (KDD) workshop was held as part of the International Joint Conference on Artificial Intelligence (IJCAI) in Detroit. The event was a foundational moment in the development of the fields now known as “Knowledge Discovery in Databases” (KDD), laying the groundwork for what would become a global research community and conference series. The term “data mining,” while often used interchangeably with KDD today, actually refers to a specific phase within the overall KDD process.

The KDD process is a comprehensive workflow for extracting useful knowledge from large datasets. The process typically consists of:

  • Selection: Identify a subset of relevant data for analysis.
  • Pre-processing: Cleanse and prepare the data, reducing noise, and dealing with missing information.
  • Transformation: Convert data into appropriate formats.
  • Data Mining: Apply algorithms and statistical techniques to uncover patterns and relationships.
  • Interpretation/Evaluation: Make sense of the results and translate findings into actionable knowledge.

The process aims to derive actionable, novel, and useful intelligence, not merely create statistical or algorithmic patterns.

By the late 1990s, “data mining” was a widely accepted technology, used to describe the process of discovering useful patterns and knowledge from data, especially in the context of large databases and machine learning.

Market Basket Analysis

At the 1993 ACM SIGMOD conference, Agrawal et al. presented Mining Association Rules Between Sets of Items in Large Databases, which defined the market basket problem and proposed efficient algorithms for discovering association rules in transaction data, such as with supermarket sales records. Progress in bar-code technology made it possible to store the “so called basket data that stores items purchased on a per-transaction basis. Basket data type transactions do not necessarily consist of items bought together at the same point of time. It may consist of items bought by a customer over a period of time. Examples include monthly purchases by members of a book club or a music club,” explain Agrawal et al. Their foundational work proposed efficient algorithms for discovering association rules in transaction data, such as supermarket sales records.

Market basket analysis is particularly prevalent in grocery stores, mostly used for strategic product placement. Walmart found strawberry Pop-Tarts sell seven times more often before hurricanes than normal, so they stocked them near the flashlights, something no homes wants to be without during a hurricane.

Fast food companies like McDonald’s created “Extra Value Meals” because customer tended to buy those items together. When combined in a single item, the average order value increased by 22%.

Ecommerce companies use market basket analysis to help with cross-selling. Shopify stores use it to recommend complementary products. The large retail and healthcare company, CVS, found patients on antidepressants often bought omega-3s, so they started promoting the items together. Anyone who has booked a flight on an airline has seen how hotels and rental cars are promoted as much as seat upgrades. These sales can add billions of dollars a year to a company’s bottom line.

Commercial Software Boom

In 1999, SAS released Enterprise Miner, a game-changing software solution for data mining. It was one of the first platforms to deliver an integrated, user-friendly, and scalable solution for advanced analytics at enterprise scale. At a time when organizations were struggling to extract actionable insights from rapidly growing and complex data, SAS Enterprise Miner allowed users to efficiently build, validate, and deploy predictive and descriptive models utilizing large datasets.

After the release of SAS Enterprise Miner, the commercial data mining software market exploded. According to Verified Market Reports, the “data mining software market size was valued at USD 12.3 Billion in 2024 and is projected to reach USD 25.7 Billion by 2033, exhibiting a CAGR of 8.9% from 2026 to 2033.” This increase is due to the explosion of AI, big data, the proliferation of social media platforms, the rise of IoT, and the need for data-driven decision-making and actionable BI in sectors like finance, healthcare, hospitality, retail, and telecommunications.

In 1996, Amazon launched collaborative filtering, which produced the well-known “If you like this…then you’ll like” customer recommendations. Two years later, Google was founded, and the company revolutionized web search, something that would have been impossible without data mining. The 2006 Netflix Prize attempted to crowdsource better recommendation algorithms.

Hadoop

In 2007 and 2008, large-scale Hadoop database clusters were deployed, producing scalability that surpassed supercomputers in sorting terabyte-scale data. By 2009, Facebook used Hadoop and Hive in production to manage datasets that grew from tens of terabytes to 2 petabytes, confirming the possibility of petabyte-scale mining. Hadoop’s MapReduce programming model facilitates the parallel processing of data mining algorithms (such as clustering, classification, or pattern detection), dramatically accelerating time-to-insight over classic, sequential methods. By open-sourcing Hadoop, Apache lowered the entry barrier for data-driven organizations, making enterprise-scale data mining accessible to smaller companies, not just the billion-dollar tech giants.

Today’s Competitive Market

Today, the analytics software market is highly competitive and diverse, with several major trends and players shaping its direction. New upstarts, like Databricks, Dataiku, Data Robot, RapidMiner, KNIME, Alteryx, and Altair are muscling into space formerly controlled by legacy tools like SAS Enterprise Miner, IBM SPSS Modeler, Oracle Data Mining, and Microsoft Azure Machine Learning. Cloud-based tools from AWS, Alicloud, and Domo are capturing interest from users open to having their data on the cloud. Open-source tools like R and, in particular, Python have garnered considerable attention lately as they are free to use. All of these tools compete on ease of use, scalability, data integration capabilities, and the breadth of machine learning and deep learning algorithms offered. LLMs are now becoming an important feature of these tools as well as generative AI, which has taken off over the past few years.

AI Revolution

The 2010s marked the AI revolution with the rise of deep learning, driven by breakthroughs in neural networks and the accessibility of big data and powerful GPUs from companies like Nvidia. Major advances in applications like image recognition, natural language processing, and autonomous systems gave rise to technologies like virtual assistants, real-time translation, and generative AI models, fundamentally transforming both consumer and enterprise technology landscapes.

The integration of AI and machine learning also transformed data mining, enabling real-time analysis, more accurate predictions, and automation of complex workflows. Cloud-based solutions increased accessibility while reducing costs, allowing even small and medium-sized enterprises to leverage the power of advanced analytics.

AI
Figure 1: Breakdown of Artificial Intelligence

Machine Learning

According to SAS, “Machine learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look.” Instead of following fixed rules written by humans, ML systems analyze large datasets, detect patterns, and use those patterns to make decisions or predictions. The ML algorithms improve their accuracy as they are exposed to more data or feedback. For example, they can recognize objects in images or predict future trends by learning from labeled examples. ML systems can classify, predict, and group new and unseen data based on what they’ve previously learned, such as identifying spam emails or recommending products.

Types of machine learning (see Figure 2):

  • Supervised learning: Learns from labeled data to predict or classify future data points. These break into classification and regression models that help with customer retention and customer worth respectively.
  • Unsupervised learning: Finds patterns or groupings in unlabeled datasets. These break into clustering and dimensionality reduction, which can be useful for recommender systems and big data visualization respectively.
  • Reinforcement learning: Learns optimal actions through trial and error, using feedback from its environment. This can be useful for real-time decisioning, robot navigation, and game AI.
Machine Learning 3
Figure 2: Machine Learning breaks into three types of learning use cases.

Deep Learning

SAS defines deep learning as “a subset of machine learning that trains a computer to perform human-like tasks, such as speech recognition, image identification and prediction making. It improves the ability to classify, recognize, detect and describe using data. The current interest in deep learning is due, in part, to the buzz surrounding artificial intelligence (AI).” Siri, Cortana, ChatGPT, Claude AI, and DeepSeek are all powered, in part, by deep learning.

Deep Learning is a branch of machine learning that seeks to imitate the neural activity of the human brain. Deep learning architectures underpin the fields of computer vision, speech recognition, NLP, audio recognition, social network filtering, machine translation, bioinformatics and drug design, amongst others. In many cases, results have proven to be comparable, if not superior to human experts.

Deep Learning Frameworks

In 2015, the Google Brain team released TensorFlow, an open-source software library for ML and AI, providing tools to build, train, and deploy a wide range of models, democratizing machine learning. TensorFlow was particularly good for tasks like image recognition, natural language processing, time series forecasting, and reinforcement learning. It is a set of software libraries that users can utilize in any application so that it can learn tasks like image recognition, speech recognition, and language translation. Other frameworks include Caffe2, Torch, Keras, Pytorch, and CNTK, which is an open-source deep-learning framework from Microsoft.

Large Language Models (LLMs)

Large Language Models are a class of deep learning systems specialized in understanding, generating, and manipulating human language text. They are based on advanced deep learning techniques, particularly transformer architectures, and are trained on extremely large and diverse text datasets—often collected from the internet, books, articles, and other sources. They are powerful tools, particularly good at:

  • Searching, translating, and summarizing text.
  • Responding to questions.
  • Generating new content including text, images, music, and software code.

LLM’s ability to combine information, analyze data, and spot trends that enable them to adapt to specific use cases beyond text creation. Their abilities span a broad range of fields, roles, and tasks, including from genetic sequencing to code generation, robot programming, investment advising, and fraud detection.

Benefits of AI

In its AI Momentum, Maturity, & Models for Success, Accenture polled 300 corporate executives and discovered that it appears likely that we are on the verge of a radical momentum shift for the technology. “Think about your first ride in a car sharing service, or the first time you used online banking,” says Oliver Schabenberger, Chief Operating Officer and Chief Technology Officer for SAS. “In a sense, those represented a leap of faith in newer technologies. That is where we are with AI right now. But even for many sophisticated users, AI is still a black box—they put data in, they get an output, and they do not understand the connections between the inputs and the outputs of AI systems. That is a fundamental challenge that has implications on everything from regulatory compliance to the customer experience it even affects how we respond to examining biases in our models. Organizations that have adopted AI can illuminate the black box by observing how the model responds to variations in the inputs, and adjusting accordingly,” Schabenberger added.

The answer to the question, “What benefits are you seeing with AI?” included more accurate forecasting and decision making, improved customer acquisition, better resource allocation, reduce operation costs, and even better anomaly detection (see Figure 3).

AI What Benefits
Figure 3: What benefits are you seeing with AI? Source: Accenture

Data Science and Machine Learning Platforms

In its May 28, 2025, publication, Magic Quadrant for Data Science and Machine Learning Platforms (See Figure 4), Gartner defines “a data science and machine learning platform as an integrated set of code-based libraries and low-code tooling. These platforms support the independent use and collaboration among data scientists and their business and IT counterparts, with automation and AI assistance through all stages of the data science life cycle, including business understanding, data access and preparation, model creation and sharing of insights. They also support engineering workflows, including the creation of data, feature, deployment and testing pipelines. The platforms are provided via desktop client or browser with supporting compute instances or as a fully managed cloud offering.”

Gartner Data Science And Machine Learning Platforms
Figure 4: Gartner Data Science and Machine Learning Platforms. Source: Gartner

These tools enable data scientists and domain experts to “identify patterns in data that can be used to forecast financial metrics, understand customer behavior, predict supply and demand, and many other use cases. Models can be built on all types of data, including tabular data, images, video and text, for applications that require computer vision or natural language processing,” says Gartner

According to Gartner, these “DSML platforms can significantly reduce the cycle time and barriers to entry for creating predictive and prescriptive models, generating insights, and distributing results.” They enable collaboration, establish data lineage, increase code and model use and help orchestrate large workloads. Ultimately, they enhance productivity and make data more valuable throughout an organization. Additionally, low-code, natural language interfaces and AI assistance enable domain experts and business users to create predictive models with simple interactions. DSML platforms support Machine Learning Operations (MLOps) practices such as deploying models in production, orchestrating both batch and real-time workloads, and ongoing monitoring of model metrics and compliance.

The Algorithms Impel, They Do Not Compel

“The stars impel, they do not compel” is a saying from astrology that emphasizes the idea that celestial bodies influence, but do not force, human actions. It means that while astrological influences can suggest tendencies or inclinations, individuals ultimately have free will to make their own choices and determine their course of action. The stars might “push” you in a certain direction, but they don’t chain you to it. One could argue, something similar happens in analytics.

In data science, predictive models can suggest probabilities. Data mining can support decision making in a variety of industries, including healthcare, banking, retail, telecommunications, insurance, energy, and many, many more. The results of these data mining processes can be used to inform business decisions and drive various business outcomes.

Massaging the Data Until it Confesses

The main goal of data science is to provide actionable insights that can inform business decisions. Examples include:

  • 1

    Healthcare

    – Personalized Medicine: Analyze patient records and genetic data to tailor treatments.

    – Disease Prediction: Forecast epidemic outbreaks and patient risks.

    – Resource Optimization: Improve hospital operations and resource allocation.

  • 2

    Retail & E-commerce

    – Customer Insights: Segment customers and predict buying trends.

    – Supply Chain Optimization: Walmart reduced out-of-stocks by 16% through demand forecasting.

    – Marketing: Personalize marketing campaigns and promotions for specific customer segments.

  • 3

    Insurance

    – Credit Scoring: LendingClub uses alternative data mining to approve 27% more worthy borrowers.

    – Customer Profiling: Assess risk and tailor insurance offerings.

    – Fraud Detection: Identify suspicious claims through pattern analysis.

  • 4

    Finance & Banking

    – Fraud Detection: Spot anomalous transactions and prevent financial crimes.

    – Risk Assessment: Credit scoring and loan default prediction.

    – Investment Optimization: Analyze market trends to inform trading strategies.

  • 5

    Telecommunications

    – Churn Prediction: Identify customers likely to switch providers.

    – Network Optimization: Enhance service quality and predict failures.

    – Fraud Detection: Analyze call records for unusual activity.

  • 6

    Manufacturing & Supply Chain

    – Process Optimization: Enhance efficiency by detecting bottlenecks.

    – Predictive Maintenance: Forecast equipment failures to reduce downtime.

    – Supply Chain Logistics: Improve inventory and delivery efficiency.

  • 7

    Energy & Utilities

    – Demand Forecasting: Predict energy consumption and manage grid resources.

    – Fault Detection: Prevent outages through monitoring and analytics.

    – Optimization of Renewables: Maximize the efficiency of wind and solar power.

Balancing High-Performance Computation with Sustainability

Data science and analytics have revolutionized multiple industries, but as the field matures, new hurdles emerge while old ones generally evolve rather than disappear. Large amounts of high-quality data are required to produce accurate and useful results. Handling, processing, and extracting insights from these increasingly larger and more complex datasets requires more powerful computer systems, which aren’t cheap and need skilled programmers to ensure they are working properly.

Explainability and trust in AI are also major challenges companies are wrestling with because many advanced AI models operate as “black boxes” whose internal decision-making processes are difficult for humans to understand. This opacity raises serious concerns, especially in high-stakes fields like healthcare and finance, where stakeholders need to understand why an AI model made a particular recommendation or prediction to ensure outcomes are fair, ethical, and compliant with government regulations.

The lack of transparency can erode trust. If users, regulators, or impacted individuals cannot see or interpret the reasoning behind AI decisions, they are less likely to rely on or approve those systems for critical use. Furthermore, as AI capabilities rapidly advance, research into interpretability lags behind, increasing the risk of deploying powerful but uncontrolled and unpredictable systems, and forcing organizations to choose between innovation and responsible oversight.

Going forward, organization need to balance high-performance computation with sustainability, real-time analytics needs, while managing their infrastructure costs. Scaling AI from pilot projects to reliable, governed, and industrialized production systems will require skilled analysts. All of these challenges require organization and their data science teams to be adaptable, ethical, technically skilled, and business-savvy to maximize impact and minimize risks as the data science field continues to evolve.

Uncovering Patterns that Drive Innovation

From its earliest days as a contentious practice dismissed as “data dredging” to its current position as a cornerstone of modern business intelligence, data science has undergone a remarkable evolution. Albert Szent-Gyorgyi’s belief that discovery lies not just in observation but in thinking differently is particularly important in data science. What began with rudimentary statistical methods has blossomed into a sophisticated discipline powered by AI, ML, and deep learning, capable of uncovering patterns that drive innovation across every industry it touches.

The journey of data mining reflects humanity’s relentless pursuit of knowledge. Market basket analysis, recommendation engines, and predictive modeling have transformed retail, healthcare, finance, and beyond, proving that data, when mined correctly, holds unparalleled strategic value. Yet, many challenges remain. Data quality much be assured, privacy concerns addressed, and biases stamped out. This is especially true for AI, which must be perceived are fair, transparent, and unbiased. Complex models and the insights gleaned from them must be understandable to decision-makers, regulators, and users alike, especially as AI-driven solutions increasingly influence so many companies’ core business decisions.

Looking ahead, emerging technologies like quantum computing, federated learning, and neuromorphic chips promise to push the boundaries of what’s possible. The rise of generative AI and explainable models will further democratize data, making data mining more accessible and transparent than ever. However, the core principle endures: data alone is inert. It is the human capacity to ask the right questions and interpret findings creatively that unlocks a company’s true potential.