Crafting Data Governance Strategies for Big Data and AI

Article Summary: Data governance is crucial in the AI era because organizations rely on data-driven decisions, requiring policies to ensure quality, security, and compliance—as seen with Saks Fifth Avenue and Honeywell, which use governance to enable real-time analytics and AI innovation. Effective governance frameworks address challenges like data volume, integration, and lineage while leveraging AI for automation, compliance, and enhanced decision-making, but success depends on building strong data foundations before deploying advanced technologies.

Last Updated: August 26, 2025

Introduction

Data governance has become increasingly important in the age of artificial intelligence (AI) because organizations rely more heavily on data-driven insights and automated decision-making than ever before. One could even argue, data is the most important element of the decision-making process. For example, strong data governance allows Saks Fifth Avenue to access and assess its data in real-time, giving it the ability to quickly find business insights that helps it bottom line. Strong data governance allows Honeywell, the large multination corporation, to offer its customers a cloud-based, software-as-a-service platform designed to enhance operational efficiency and productivity across various industry sectors.

What is Data Governance?

In 10 Key Components of Data Governance Program, I argued that “Data governance is a critical component of an organization’s overall data management strategy. It refers to the set of processes, policies, and standards that ensure the quality, security, and integrity of an organization’s data. Effective data governance is essential for ensuring that data is accurate, complete, and consistent, and that it is used in a way that supports the organization’s goals and objectives.”

Data governance is essential for Big Data and AI. In his article, New AI survey: Poor data quality leads to $406 million in losses, Mark Van de Wiel, Field CTO of Fivetran, states, “Ninety-seven percent of surveyed respondents acknowledge their organizations are investing in generative AI in the next 1-2 years. However, the survey also found that models trained on inaccurate, incomplete and low-quality data caused misinformed business decisions that impacted an organization’s global annual revenue by 6%, or $406 million on average, based on respondents from organizations with an average global annual revenue of $5.6 billion.”

What is Big Data?

Big Data refers to extremely large, complex datasets that traditional data processing tools cannot efficiently manage or analyze. It is characterized by the “5 V’s”, a term which is often attributed to Doug Laney, who introduced them in a 2001 Gartner research note. The five Vs are:

Volume – Massive amounts of data (e.g., social media, IoT sensors, transactions).
Velocity – High-speed generation and processing (e.g., real-time analytics, stock trading).
Variety – Diverse data types (structured, unstructured—text, images, videos).
Veracity – Data quality and reliability (addressing noise, bias, inconsistencies).
Value – Extracting actionable insights for decision-making.

Data Volume

Although Big Data brought with it the potential to utilize data in ways few could have imagined several decades ago, it also came with serious challenges, the first of which was just the sheer size of data being produced. This brought about the question of storage, where would all this data go? The sheer volume of data generated every day from mobile phones, social media, IoT devices, customer transactions, etc., require massive amounts of storage. These systems also have to scale efficiently to handle the growing deluge of data without any performance degradation.

Data Velocity

Data velocity refers to the speed at which data is generated, collected, and processed. High-speed data streams include stock market data, smart city sensors, factory machine sensors, social media feeds, clickstream analysis, ad bidding platforms, IoT sensor data, and even POS purchase data requires real-time or near-real-time processing. While high-speed data streams enable real-time analytics, they also introduce significant challenges, including:

Real-time processing bottlenecks — Traditional batch-processing systems such as Hadoop can’t keep up with streaming data.
Storage overload — High-velocity data can flood storage systems, increasing costs and complexity.
Data quality degradation — Fast-moving data is harder to clean, leading to errors (e.g., duplicate sensor readings).
Network and infrastructure strain: High-velocity data demands low-latency networks and edge computing.
Compliance and security risks: Rapid data flows make it hard to enforce governance.
Resource intensive analytics: Real-time analytics require massive compute power (e.g., algorithmic trading).

Data Variety

Big Data comes in all varieties, including structured, unstructured and semi-structured data. This can include text, images, videos, logs, and more. The diverse variety of formats, structures, and sources of data (e.g., text, images, databases, IoT streams, social media) introduces considerable integration, scalability and regulatory issues. Structured (SQL tables), semi-structured (JSON, XML), and unstructured (emails, videos) data require different processing. Merging CRM customer data with social media sentiment analysis data, although a good idea as it provides a fuller picture of the customer, can be error-prone if there is no data standardization. Inconsistent metadata and data definitions can result in misaligned analytics, reporting errors as well as compliance risks. Scalability issues are common as well. Unstructured data may contain hidden personal information, which could violate GDPR, HIPAA, or other federal and/or international laws.

Data Veracity

Big Data can be noisy, incomplete, inaccurate, biased, or inconsistent, which can lead to poor corporate decision-making. Biased datasets, in particular, can produce flawed AI and machine learning (ML) models. When massive volumes of information are processed, veracity is critical because flawed data can lead to flawed decisions.

The 1-10-100 rule of data quality is a principle of data quality management that emphasizes the escalating cost of addressing errors as they progress throughout the data process. While catching poor data quality at the source might cost $1 per data record, the cost of remediating the issues at the next data stage might be $10 per record. The cost of failure (i.e., doing nothing) at an ensuing stage will be $100 per record. This rule suggests that the cost to fix an issue increases tenfold through each subsequent step.

Data Value

Managing large datasets requires expensive hardware, software, and even sometimes cloud solutions that create high data storage and compute costs. Big Data systems, such as Hadoop and Spark, require skilled personnel for setup, maintenance and upkeep. Although analytical software prices are coming down, the “citizen data scientist” revolution is still in its infancy, and good modelers are hard to come by and not cheap. This means CTOs must ensure the data they are keeping and analyzing has value beyond the cost of storing, analyzing and utilizing it.

Data Quality Management

In his article, The 42 V’s of Big Data and Data Science, Tom Shafer, Principal Data Scientist at Elder Research, states that many new V’s were added after the initial five. Today, that number totals 42. This includes additional characteristics like variability validity, and visualization, but also a few ones that seem facetious. For instance, Vexed — “Some of the excitement around data science is based on its potential to shed light on large, complicated problems” — seems tongue-in-cheek. Shafter also includes voodoo, which, he argues, big data isn’t, but he recognizes the need to convince potential customers of “data science’s value to deliver results with real-world impact.” Although Shafer might be a cheekily referencing the enormous supercomputer in Douglas Adam’s seminal work, The Hitchhiker’s Guide to the Galaxy, which answered “42” to the question, “What is the ultimate meaning of life, the universe, and everything?”

AI to the Rescue

No matter how many V’s there are, we do know Big Data has exacerbated the data quality problem. Thankfully though, Big Data might, through AI, also be part of the solution. For instance, AI can:

Play a significant role in enhancing data governance.
Automatically tag and classify data based on content, context, and usage, making data easier to manage and retrieve.
Maintain comprehensive data catalogs that provide insights into data lineage, ownership, and usage patterns.
Reduce manual analysis efforts.
Identify anomalies or inconsistencies in data sets, flagging potential errors for review.
Automate the data cleansing processes, ensuring data accuracy and completeness.
Monitor data usage and access patterns to ensure compliance with regulations, such as GDPR, HIPAA, or CCPA.
Enforce important data governance policies.
Analyze data access patterns to identify potential security risks or unauthorized access.
Detect unusual behavior that may indicate a data breach or a security threat.
Improve data quality, data compliance, data security, and overall data management.

The Foundation for Technological Success

In the MIT Technology Review article, AI-readiness for C-suite leaders, the MIT Technology Review Insights team claims, “AI is showing executives that a compelling data strategy is the foundation for technological success.” In its survey of 300 C-suite executives and senior technology leaders, as well as with interviews of four leading experts, the MIT Technology Review Insights team found that 82% of those surveyed agreed that scaling AI or generative AI use cases to create business value was a top priority for their organization. For those surveyed, the four main challenges in achieving AI readiness included “managing data volume, moving data from on-premises to the cloud, enabling real-time access, and managing changes to data.”

Among those surveyed by the MIT Technology Review Insights team, 83% said their organization had identified numerous sources of data that they needed to bring together in order to enable their AI initiatives, and they were looking for software solutions that were both scalable and use-case agnostic. Although data-dependent technologies in recent decades have driven data integration and aggregation programs, these are typically tailored to specific use cases and not useful when it comes to AI data governance.

Now, 82% of respondents are prioritizing solutions “that will continue to work in the future, regardless of other changes to our data strategy and partners. Data governance and security should be addressed from the beginning of any AI strategy to ensure data is used and accessed properly,” says the MIT Technology Review Insights team. Adaptability is key.

Preparing an organization’s data for AI

82%

of those surveyed agreed that scaling AI or generative AI use cases to create business value is a top priority for their organization.

AI-readiness for C-suite leaders

Source: MIT Technology Review Insights

The Data Challenges of AI

The current state of IT systems at many companies means data governance is far from a straightforward endeavor. Stewart Bond, vice president of IDC’s Data Intelligence and Integration Software Service, says that enterprise information is often highly fragmented. A recent IDC survey found businesses use “over a dozen different technologies just to harvest all the intelligence about their data and the same number to integrate, transform, and replicate it.” CTOs are very well aware of the situation. They know they must invest in technology that helps them get control over their data.

George Fraser, CEO at Fivetran, concurs, noting the importance of investing in data foundations before deploying AI. “You want to make sure that you have an enterprise data warehouse with clean, curated data, which should be supporting all of your traditional BI and analytics workloads, before you go and start hiring a lot of data scientists and initiating a lot of generative AI projects,” contend Fraser. Organizations who don’t build strong data foundations first will find their data scientists wasting valuable time on basic duties like data cleansing and/or data integration work, warns Fraser. As any CTO will tell you, the last thing they want is their highly paid data modelers cleansing data, when AI can do it faster and cheaper.

The High Cost of Bad Data

Bad data is costly. A study by Vanson Bourne and Fivetran found that AI models built on low-quality data cost companies an average of 6% of annual revenue. In retail, out-of-stocks due to bad data accounted for $1.2 trillion and overstocks totaled $562 billion in 2022, according to BlueYonder. Experian found that “85 percent of organizations indicate that poor-quality contact data for customers negatively impacts their operational processes and efficiency and, in turn, hinders the chances of being flexible and agile. Poor-quality data has a ripple effect as do operational issues.”

The challenge of data quality, cited by 43% of those surveyed, is multi-faceted. “Data needs to be clean, consistent, correct, and timely, and you have to be able to use it in the appropriate context,” says Bond. He adds, “The reality is, though, that data will never be all of these things at the same time.” It’s not like companies haven’t tried. As Bond explains, over the last three decades, “the aspiration to collect and combine data into a single view of reality has been a constant one—if never fully achieved.”

The “Single Customer View” is a personalization goal CTOs have been searching for and never quite realizing for decades now. It’s like a mythical mirage inside a digital dream, but it would be so valuable to attain, no CTO can give up on it. Bad data hampers the journey; good data governance is the only way to get there.

The Difficulties of Preparing Data for AI

The data needed for AI is voluminous, often changing rapidly, of a multitude of types, and can be in different degrees of cleanliness and completeness. As Bond puts it, “data is all over the place. It’s very diverse, with so many different kinds, types, formats, and modes. It’s also very dynamic, always moving and changing.”

When it comes to the primary data readiness challenges (see Figure 1), the MIT Technology Review Insights team found roughly four in 10 respondents included each of the following data issues:

Data integration or pipelines (45%).
Data governance or security concerns (44%).
Data quality (43%).
Organizational culture around data (39%).

Benefits Of Investing In Data Governance 2 — Figure 1: Difficulties in preparing data for AI

Survey respondents agreed that the necessary data integration foundations should be in place before building any AI tools.

The “breadth and complexity of the data required for AI explains the wide range of data integration difficulties
companies face,” says the MIT Technology Review Insights team. Figure 2 shows the primary data integration challenges companies face. Managing the volume of data thanks mostly to Big Data is the top challenge at 45%. The embrace of cloud technology also has its adding issues, with “moving data from on-premises to cloud” the second choice at 43%. Companies looking to do analytics really have no choice but to put their data in the cloud as the sophisticated analytical systems needed for AI and ML modeling require it.

Data Integration Challenges.jog — Figure 2: Data integration challenges

Data Centralization

George Fraser, CEO of Fivetran says, “Data needs to be all in one place, accurate, up-to-date, and in a schema that the people at your company understand.” Bond agrees that a centralized approach to data storage is needed as it allows businesses “to put proper controls on it.”

Hite, meanwhile, stresses that the changing economics of data storage should reshape thinking about its management. He explains that, in the past, additional data storage came at a cost. This made it important to transform information within data pipelines so that it could be stored in a consolidated form. Technological advances, however, have eliminated this marginal cost, he argues. Instead, companies are now only paying for utilization. “The days of having to massage data to land in a shape where it’s already useful are gone. You can land it in as raw a form as possible,” adds Hite.

In a world where data has become the lifeblood of an organization, the importance of strong data lineage cannot be underestimated. AI tools can track data lineage, from creation to use to destruction, allowing organizations to understand where data comes from, how it is being used and/or transformed. AI can simulate the impact of changes to data structures or governance policies, helping teams make informed decisions.

Case Studies

Saks Fifth Avenue

In its AI readiness for C-suite leaders, the MIT Technology Review Insights team states, “The luxury retailer Saks has transformed its data integration processes to generate more and more useful business insight. Faster access to real-time data gives the data team both the tools and the time to experiment with finding business value, including in deploying AI models and applications.”

Saks implemented a data lake that ingests a plethora of data types into its data warehouse. Strong data governance allows the retailer to move quickly and easily when analyzing the data. New data sources that might be interesting to the data team can be quickly added to the data lake, which doesn’t require the creation of manual data pipelines. Saks rapidly pulls data into its data warehouse and runs its models to find out if its data is useful or not. Saks doesn’t burn a lot of time integrating the data. “Any failures are fast and low cost, allowing for more experimentation, and enabling the top- and bottom-line benefits such trials can bring,” adds the MIT Technology Review Insights team.

Simplifying the Data Integration Process

To reshape how it aggregates information, Saks has greatly simplified its data integration process. Previously, the data had to be extracted from an application, translated into a common format, transformed, cleansed, and loaded into a repository. Today’s modern approach keeps everything in its raw form. “Our single purposes for data integration are making sure that the data from each application is complete and our data lake is as real time as possible,” says Mike Hite, chief technology officer. Then the company can transform the raw data sitting in its data lake into a KPI or a sales metric or plug it directly into an LLM for some generative AI uses or even into a predictive AI model, says Hite.

Data governance has increased the speed and efficiency of utilizing data as well as reduced costs. “Where it once would have required several months to onboard data, it now takes as little as an hour,” claims Hite. This huge sea change in the way Saks thinks about and utilizes its data allows analysts and modelers to spend the bulk of their time understanding data rather than prepping it, which, Hite contends, is far more valuable for the company in the long run.

Honeywell

For Honeywell, a multinational corporation working in the aerospace, industrial automation, and energy industries, good governance helps them implement generative AI across their enterprise performance management platform. Suresh Venkatarayalu, Honeywell’s chief technology and innovation officer, explains that the company has long embedded deterministic AI within its core products, such as Honeywell Forge, a cloud-based, software-as-a-service platform designed to enhance operational efficiency and productivity across various industry sectors. Forge provides AI-enabled applications and services for intelligent, efficient, and secure operations, particularly in industrial settings, buildings, and aerospace. because of the strong data governance Honeywell has implemented, questions of governance and information security are largely settled.

“By bringing together decades worth of data from disparate systems across its support business, including service tickets, product manuals, maintenance records, knowledge articles, and technical publications, Honeywell aims to train an AI model that can assist the company’s maintenance engineers, service engineers, and operators,” says the MIT Technology Review Insights team. Honeywell believes this opportunity has the potential to augment the knowledge of its technicians while reducing the risk of production and operations losses, add the MIT Technology Review Insights team.

Honeywell’s Generative AI Strategy

Generative AI also brings both compliance challenges and opportunities. At the core of Honeywell’s generative AI strategy is a foundation of responsible AI governance. A Data and AI Steering Council made up of leaders across the organization meets monthly to drive the overall governance, development, and deployment strategy. The maintenance project described above gives a good example of what this means in practice. Even though those involved in it are still examining fundamental questions, such as which data to include, what to build in-house, and how to measure accuracy and benefits, the governance is largely in place. “At this juncture,” Venkatarayalu reports, the team, along with its generative AI partner, has “put enough guardrails and controls in place so that we are comfortable that we are not revealing the overall IP, including proprietary data and domain knowledge.”

Key Components of an Effective Data Governance Strategy for AI and Big Data

Governance Framework

1. Establish a clear governance structure with defined roles and responsibilities.

2. Create a dedicated governance body (e.g., data governance council) to oversee policies and standards.

Data Quality Management

1. Implement regular data quality assessments to ensure accuracy and reliability.

2. Utilize AI tools for automated data cleansing and anomaly detection.

Compliance & Risk Management

1. Ensure adherence to relevant regulations (e.g., GDPR, HIPAA).

2. Develop policies for risk assessment related to data usage in AI applications.

3. Map data flows and identify regulatory requirements.

4. Classify data by sensitivity and jurisdiction.

5. Enforce role-based access controls (RBAC).

6. Prepare for breach notifications.

7. 8. Document Compliance for Audits.

Promote Data Literacy

1. Assess current data literacy levels

2. Tailor training programs

3. Train staff on data governance principles, ethical considerations, and responsible AI use.

4. Integrate AI literacy into daily workflows

5. Create an open culture about AI.

6. Measure progress and iterate.

Stakeholder Engagement

1. Personalize stakeholder interactions.

2. Automate Routine Communications.

3. Enhance Decision-Making with Real-Time Insights.

4. Improve Transparency with AI-Powered Reporting.

5. Proactively Address Concerns.

Standardize Data Classification

1. Implement classification schemes to categorize data based on sensitivity and compliance requirements.

2. Create a hierarchical classification system (e.g., by sensitivity, domain, or regulatory requirement).

3. Automate classification with AI by deploying AI models to scan and label data.

4. Enforce metadata standards by mandating metadata fields (e.g., owner, creation date, retention period).

Unlocking Transformative Insights

Today, effective data governance is no longer optional—it is the backbone of successful corporations. As organizations like Saks and Honeywell demonstrate, robust governance frameworks enable real-time data access, enhance decision-making, and unlock transformative business value. Conversely, poor data quality costs enterprises millions in lost revenue. Even worse, AI models built on flawed data can perpetuate bias, compliance risks, and operational failures.

The Five V’s of Big Data (Volume, Velocity, Variety, Veracity, and Value) underscore the challenges—exploding data scales, integration complexities, and the critical need for accuracy. However, AI also offers solutions: automated data cleansing, anomaly detection, and lineage tracking can turn governance from a burden into a competitive advantage.

AI can analyze historical data to predict trends and outcomes, aiding in decision-making processes related to data governance. It can generate insights and reports on data usage, quality, and governance compliance, helping stakeholders make better data-driven decisions. To thrive in the AI era, companies must prioritize data foundations—clean, integrated, and well-governed data—before deploying advanced analytics.

While Big Data unlocks transformative insights, addressing the data challenges that comes with it requires a combination of technology investments, skilled talent, data governance policies, and ethical considerations. Organizations that navigate these hurdles effectively can gain a competitive edge through data-driven decision-making.

MIT Technology Review Insights warns that “merely deploying open-source foundational AI models is unlikely to provide differentiation. Saks chief technology officer Mike Hite explains that ‘off-the shelf’”’ AI tools will give all users the same answers. Organizations that are pulling ahead do not necessarily have large AI teams or better large language model (LLM) technology.” Instead, Hite argues, the people with the most interesting data sets on a particular subject will have the real opportunity. Creativity will, even with LLMs and Generative AI, continue to be king.

Dr. David Marco, IIM Fellow, CBIP, CDP, IIBA Global Analytics Expert

Best known as the world’s foremost authority on data governance and metadata management, he is an internationally recognized expert in the fields of data management, AI governance, data literacy, advanced analytics, and data stewardship. Dr. David Marco has earned many industry honors, including Crain’s Chicago Business “Top 40 Under 40”, named by DePaul University as one of their “Top 14 Alumni Under 40”, and he is a Professional Fellow in the Institute of Information Management. In 2022, CDO Magazine named Dr. Marco one of the Top Data Consultants in North America and IDMMA named him their Data Management Professional of the Year. He is the president of DataManagementU.com and is their lead contributor. He is IIBA’s Analytics Expert. In 2023, LinkedIn named Dr. Marco LinkedIn’s Top Business Intelligence (BI) Voice. In 2024, he received the BIG Innovation Award for Data Management and CDO. In 2024, Dr. Marco was named LinkedIn’s Top Data Governance Voice.

David Marco is the author of the widely acclaimed two top-selling books in metadata management history, “Universal Meta Data Models” and “Building and Managing the Meta Data Repository” (available in multiple languages). In addition, he is a co-author of numerous books and has published hundreds of articles, some of which are translated into Mandarin, Russian, Portuguese, and others. He has taught at the University of Chicago and DePaul University.