Introduction
Shortly after his victory at The Battle of Hasting in 1066, William the Conqueror ordered a comprehensive survey of England, which became known as The Domesday Book because its judgments were as final as Judgment Day. Completed in 1086, The Domesday Book comprehensively surveyed much of England and parts of Wales and became the first national census in European history. The survey documented over 13,000 places, detailing who owned the land, how it was used, its population, and all assets on the property, such as livestock and buildings.
The survey revealed that William directly controlled about 20% of the land, Normans controlled around 50%, the Church about 25%, and English nobility only 5%, highlighting the shift in power following the Norman conquest. The document’s primary purpose was to provide a detailed record of land ownership, resources, and wealth. It enabled William to assess and collect taxes effectively. Although the tome was filled with dry facts and figures, it was something not on the written page that motivated William — he was one of the first people to realize political power lay more in numbers, less in swords. He innately understood whoever controlled the data controlled the kingdom.
The Start of the Data Revolution
William would certainly have appreciated how his system of collection kicked off the data revolution. Much of what he did in the eleventh century parallels what we do with data today. His royal clerks recorded every ox, land acre, and serf across 13,000 settlements, creating a standardized metric for wealth. He established data validation through sworn testimonies from local jurors. Stewardship breakthroughs included a centralized authority where data flowed through local village sheriffs onto the royal treasury. The system was purpose-driven and there was tight access control. Only the King’s auditors had the power to amend records, creating a medieval version of admin privileges. The results were impressive; within a decade, tax revenues increased by 20%, and considerable hidden assets were uncovered. William’s system was so impressive, it remained the legal land registry in England for 800+ years. Talk about data longevity.
Data is used for everything today. Implementing the best practices for data governance requires realizing that data stewardship is its foundational pillar. The successful alignment of data governance, artificial intelligence, machine learning, and emerging technologies ensures an organization’s data assets remain accurate, secure, and highly actionable. AI-driven tools, such as anomaly detection, automated data quality management, intelligent metadata handling, self-service data prep, scalable data cleansing, and proactive issue resolution, are making data stewardship more scalable, efficient, and precise.
What is Data Stewardship?
The French data management company, Semarchy , provides a description of data stewardship using a clever country analogy. All nations have residents living, acting, and influencing their state. “Usually, they are bound to rules and laws which are decided on by a government or by social agreement, and enforced by bodies like the police force, or a responsible parent. If you understand this, you understand data governance, stewardship, and management,” claims Semarchy.
Data governance is how the government is arranged (aka the Constitution or country statutes) as well as the decisions and laws it legislates. Data stewardship, however, is what the police do, i.e., the enforcement of the laws while data management is the daily work of citizens. They go about their business as usual, living and acting as happily as possible within the limits of the law. Semarchy concludes, “Data stewardship is the practice of managing and overseeing an organization’s data assets to ensure their quality, accessibility, and security. It’s like having a dedicated caretaker for your data, making sure it’s always in top shape and ready to use.”
The key components of data stewardship include data management, data roles and the precise responsibilities for those roles, data quality, data compliance, data security, and metadata management. Data stewardship encompasses the entire data lifecycle, from creation to storage to usage, including complex architectures like a dimensional data modeling fact qualifier matrix , archiving, and ultimately deletion. For a company to function properly, its data must be accurate and reliable no matter how long it is used and this is where data stewardship is essential.
A Data Steward’s Responsibilities
As I lay out in my article, Data Steward Roles and Responsibilities: A Complete Guide , “Data stewards play a pivotal role in effective data stewardship, working across various business functions to uphold the organization’s data governance framework. Their responsibilities encompass a range of activities essential for maintaining data quality, data standardization, and governance over the entire data lifecycle.” Key responsibilities include acting as an intermediary between technical teams and business executives, monitoring data from creation to deletion, conducting regular data audits to ensure data quality, data cataloging, and ensuring data integrity across all of the company’s data assets. Data stewards drive all corporate-wide data literacy initiatives as well.
Artificial Intelligence
Artificial intelligence (AI) is a technology that enables computers and machines to simulate human cognitive functions, such as learning, reasoning, problem-solving, decision-making, creativity, and autonomy. AI systems can perceive their environment, understand and respond to human language, recognize objects, learn from new data and experiences, and act independently to perform complex tasks that historically required human intelligence.
“The steam engine gave us a tremendous number of physical superpowers in manufacturing, transport, and construction by ultimately creating machinery that was more powerful and mobile than simple watermills,” says tech titan, Reid Hoffman, in Gen AI: A cognitive industrial revolution . “The same thing is happening now with cognitive capabilities in anything that we do that uses language, be it communication, reasoning, analysis, selling, marketing, support, and services,” he adds.
In their article, Artificial Intelligence for the Real World, Don’t start with moon shots , Thomas H. Davenport and Rajeev Ronanki believe companies should “look at AI through the lens of business capabilities rather than technologies. Broadly speaking, AI can support three important business needs: automating business processes, gaining insight through data analysis, and engaging with customers and employees.”
So, what do the numbers really look like? In their article, Artificial Intelligence for the Real World, Don’t start with moon shots , Davenport and Ronanki claim GE has AI and ML technology “to integrate supplier data and has saved $80 million in its first year by eliminating redundancies and negotiating contracts that were previously managed at the business unit level.” Nothing to sniff at, obviously.
Leveraging AI to Strengthen Data Stewardship
Anomaly Detection
Anomaly detection is the process of identifying unusual patterns in data straying significantly from expected behavior. These anomalies may indicate errors, fraud, defects, or critical insights. AI, in particular machine learning and deep learning, are helping organization spot these anomalies. ML can identify unusual patterns in logs, metrics, or network traffic, even detecting a DDoS attack before it causes operational downtime.
In its article, How does machine learning improve anomaly detection? , Milvus claims, “Machine learning improves anomaly detection by enabling systems to automatically learn patterns from data and identify deviations without relying solely on hard-coded rules. Traditional methods often depend on predefined thresholds or static heuristics, which struggle to adapt to complex or evolving data. Machine learning models, by contrast, analyze historical data to understand normal behavior and flag outliers based on learned patterns.” Milvus provides the example of a supervised learning model trained on labeled datasets which can differentiate legitimate from illegitimate features, which is what signifies fraudulent activity. In addition, unsupervised techniques, like clustering, can separate out anomalies that don’t fit any cluster type. “This adaptability makes machine learning particularly effective in dynamic environments where anomalies evolve over time,” contends Milvus.
In his article, Anomaly Detection: A Key to Better Data Quality , Dutta Angsuman, CTO of FirstEigen, adds to Milvus’s claims, stating, “Another ongoing trend in anomaly detection is the use of predictability. This involves using ML and AI technology to predict where outliers are likely to occur, allowing systems to quickly and efficiently identify anomalous data – including malicious code – before it affects data quality.”
Real World Anomaly Detection
Machine learning breaks down into three different types of modeling — Supervised learning, unsupervised learning, and semi-supervised learning. Each technique has its own real-world anomaly detection use case. Supervised learning, like the name implies, requires supervision. Data analysts must label data points as necessary, and this labeled data is used as a training dataset. This type of learning model can detect outliers similar to examples given, but it won’t spot unknown anomalies or predict any future issues.
Supervised Learning
K-nearest neighbor (KNN) algorithms, linear regression, and local outlier factors (LOF) are common supervised learning algorithms. These algorithms identify outliers based on how isolated a data point is relative to its local neighborhood (rather than the entire dataset). The main difference between a KNN and an LOF is that “while KNN makes assumptions based on data points that are closest together, LOF uses the points that are furthest apart to draw its conclusions,” explains Camilo Quiroz-Vázquez in her IBM article, Anomaly detection in machine learning: Finding outliers for optimization of business functions ,
Quiroz-Vázquez says that retailers using labeled data from a previous year’s sales figures can predict future sales goals. “It can also help set benchmarks for specific sales employees based on their past performance and overall company needs. Because all sales data is known, patterns can be analyzed for insights into products, marketing and seasonality,” adds Quiroz-Vázquez.
Unsupervised Learning
This is a type of deep learning that mimics the way biological neurons signal each other to find hidden patterns, groupings, or structures without explicit guidance. “These techniques can go a long way in discovering unknown anomalies and reducing the work of manually sifting through large data sets,” says Quiroz-Vázquez. However, because these techniques make assumptions about the data input, they can incorrectly label anomalies, warns Quiroz-Vázquez.
K-Means, isolation forests, and support vector machines (SVMs) are unsupervised machine learning algorithms used for clustering and identifying outliers.
In manufacturing, unsupervised learning algorithms can inspect unlabeled data from sensors attached to equipment and forecast any potential malfunctions in a predictive maintenance way. Companies can proactively make repairs before any critical breakdowns, thereby reducing potential machine downtime, says Quiroz-Vázquez.
For corporate cybersecurity, unsupervised learning algorithms spot potential network attacks in real-time “These algorithms can create a visualization of normal performance based on time series data, which analyzes data points at set intervals for a prolonged amount of time. Spikes in network traffic or unexpected patterns can be flagged and examined as potential security breaches,” says Quiroz-Vázquez.
Semi-supervised Learning
Semi-supervised learning combines the best of both learning methods. As Quiroz-Vázquez explains, “Engineers can apply unsupervised learning methods to automate feature learning and work with unstructured data. However, by combining it with human supervision, they have an opportunity to monitor and control what kind of patterns the model learns.” This usually increases the model’s predictive capabilities.
“Predictive algorithms can use semi-supervised learning that require both labeled and unlabeled data to detect fraud. Because a user’s credit card activity is labeled, it can be used to detect unusual spending patterns. However, fraud detection solutions do not rely solely on transactions previously labeled as fraud; they can also make assumptions based on user behavior, including current location, log-in device and other factors that require unlabeled data,” concludes Quiroz-Vázquez.
Anomaly Detection
$8.6B
The global market for anomaly detection solutions is expected to reach $8.6 billion by 2026, with a compound annual growth rate of 15.8%.
Global Anomaly Detection Industry report
Automated Data Quality Management
AI algorithms can identify errors, inconsistencies, and duplicates within a dataset and automatically correct them without any manual intervention. Machine learning models detect patterns and outliers indicating data quality issues. This enables efficient and accurate cleansing and standardization across diverse data sources. Unlike traditional periodical monitoring checks, AI enables continuous, real-time inspection of data streams to instantly detect anomalies or inconsistencies that might affect data quality. Organizations can quickly address issues before they negatively impact the business.
AI can cull through historical data trends and forecast where and when data quality problems might arise in the future. This foresight allows organizations to implement preventive measures, which can reduce errors and improve data reliability over time.
AI-powered profiling tools automatically analyze datasets to understand their structure, content, and quality dimensions. Augmented data catalogs identify key data domains (e.g., names, addresses) and help apply consistent quality rules across all sources, making integration seamless and scalable. AI can classify unstructured data like emails and images.
Using natural language processing (NLP) and pattern recognition models, AI can identify duplicates, spot outliers, and uncover inconsistencies (e.g., the use of “LA” instead of the correct “Los Angeles”). Suggestions for fixes or auto-corrects can be made based on rule-based validation systems, which can also flag duplicates, missing values, and inconsistent formats.
AI provides intelligent deduplication. ML models can detect and merge duplicate records even when they are not exact matches, improving data integrity and veracity. Probabilistic matching algorithms can unify customer profiles, supplier data, and transaction entries across a system.
Key AI Techniques
Technique Application NLP Parsing unstructured text (emails, logs) Clustering (k-means) Grouping similar records for deduplication Autoencoders Detecting anomalies in complex datasets Fuzzy Matching Identifying near-duplicates (e.g., “Jon Doe” vs. “John Doe”) Reinforcement Learning Optimizing cleansing rules dynamically
Intelligent metadata handling refers to the use of AI, automation, and advanced analytics to manage, enrich, and govern metadata—the “data about data”—to improve data quality, discoverability, and compliance. For data stewardship, this means ensuring metadata is accurate, actionable, and aligned with governance policies.
In its Metadata Management Gets Smarter with AI and NLP , the Datahub Analytics Team claims, “Effective metadata management ensures that data assets are not just stored, but are also searchable, trusted, and ready for use across business functions.” AI-driven tools can identify data types, usage patterns, relationships, and even lineage, slashing the need for manual input. AI maps data flows, automatically tracing how data moves from source to dashboard, while also helping architects weigh the benefits and limitations of abstract data models during system design. AI-Powered tagging can auto-classify data (e.g., labeling a column as “Personally Identifiable Information” or “CRM Data”). AI can understand contextual information. It can pull metadata from file names, content, and usage patterns. Usage-based recommendations can suggest relevant metadata based on how data is queried.
Governance & Compliance Regulatory Automation
Navigating AI regulations offers key insights and impacts for businesses , and modern AI helps with real-time compliance monitoring by providing up-to-the-minute dashboards that reflect your current risk posture, control effectiveness, and regulatory alignment. Real-time insights enable compliance leaders to act proactively not reactively, improving decision-making speed and accuracy while reducing the risk of reputational damage and monetary fines, which can be quite significant these days.
AI continuously monitors data and processes to automatically enforce compliance policies. It adapts dynamically to regulatory changes by updating rules and procedures without manual intervention. This ensures ongoing alignment with constantly changing laws and standards.
AI tools automatically identify, classify, and tag sensitive data, applying encryption and access controls to protect privacy and comply with global frameworks as well as specific AI regulations in the United States , like HIPAA. This reduces human error and even enhances data security.
Robotic Process Automation (RPA) combined with AI automates repetitive tasks such as data entry, report generation, and audit trail creation, reducing manual labor and human errors while freeing compliance teams to focus on strategic activities.
Other ways AI can help with governance and compliance include:
AI Governance Policy Enforcement: AI flags non-compliant metadata (e.g., “Unlabeled Personally Identifiable Information (PII) detected in employee_data“).
Privacy Risk Scoring: Ranks datasets by sensitivity (e.g., “This table has three high-risk GDPR fields”).
AI monitors access patterns to detect misuse (e.g., marketing teams querying healthcare data).
Auto-redacts sensitive fields in queries (e.g., masking SSNs).
Predicts GDPR/HIPAA risks by analyzing data usage trends.
Filters out irrelevant alerts.
Fairness Checks: AI detects skewed data (e.g., hiring datasets favoring one demographic) and suggests mitigations.
Audit Trail AI: Tracks who altered data and why, with anomaly alerts.
Scans for PII/GDPR violations (e.g., unsecured SSNs).
Bias Mitigation: Flags discriminatory patterns.
Self-Service Data Prep
As Mariam Anwar explains in her article, Self-Service Data Preparation: The Pathway to Business Growth , self-service data prep is about “putting the power of data straight into the hands of those who need it most: business users, analysts, managers, and others who might not have technical expertise in data handling.” Adding AI to the self-service data prep process empowers business users and data analysts to clean, transform, and prepare data for analysis without a heavy reliance on their IT or data engineering departments. AI enhances this process by automating many complex and time-consuming tasks, making data prep faster, more accurate, and more accessible to non-technical users.
AI algorithms automatically detect and correct errors, inconsistencies, duplicates, and missing values in datasets, reducing manual data quality checks and minimizing errors.
Tools like Qlik Insight Bot, an AI-driven, conversational analytics tool, allows users to ask natural language questions about their data and receive instant insights—without needing to build charts or write queries manually.
Other natural language processing (NLP) tools allow users to interact with data preparation tools using plain language queries or commands, making it easier to perform complex transformations without any coding skills. Requests like “Show me all customer datasets updated last month” would utilize powerful LLMs like GPT-4 to get the data the user requested, demonstrating exactly how generative AI helps data governance scale its user accessibility. Because AI learns from a user’s behavior, it could quickly offer up personalized recommendations. It will also automate repetitive tasks, allowing users to focus their valuable time on analysis and decision-making.
AI can also analyze the data’s structure and content to suggest appropriate transformations, data enrichments, and formatting changes, speeding up the data preparation process. Finally, when it comes time to documenting projects, AI auto-documentation will generate data dictionaries and usage guides from usage patterns.
Accuracy, Efficiency, Scalability and Adaptability
In his paper, AI-Powered Data Cleansing: Innovative Approaches for Ensuring Database Integrity and Accuracy , Vijay Panwar compared the performance of AI models with standard data processes, claiming the former were significantly superior to traditional methods in several key areas, including accuracy, efficiency, scalability and adaptability. For example, “supervised learning models, when applied to duplicate detection tasks, showed a 20% higher accuracy rate compared to rule-based approaches. This improvement is attributed to the model’s ability to learn complex patterns and anomalies beyond predefined rules,” contends Panwar.
The data cleansing process was considerably enhanced as well, with unsupervised and semi-supervised models reducing the time to cleanse large datasets by approximately 50%, says Panwar. “This efficiency gain is due to the algorithms’ ability to process and analyze data at scale. This task is labor-intensive and time-consuming with manual or semi-automated methods,” adds Panwar. “AI models, especially those employing unsupervised learning, exhibited remarkable scalability and adaptability to different types of data and inconsistencies. They were able to maintain high accuracy levels even as the volume of data increased, a critical advantage for organizations dealing with growing data repositories,” says Panwar. “AI-powered data cleansing represents a significant leap forward in ensuring database integrity and accuracy,” he concludes.
Scalable Data Cleansing AI-Powered Cleansing
Traditional data cleansing methods often miss the mark in addressing the complexity and scale of today’s data environments. AI-driven data cleansing adapts dynamically to evolving data patterns, enabling continuous improvement and scalability. The system autonomously detects errors such as duplicates, missing values, inconsistencies, and anomalies, then applies corrections based on learned rules and patterns. This reduces reliance on manual data cleaning and static rule sets that can become outdated.
NLP interprets and standardizes unstructured text data such as addresses, product descriptions, and customer feedback, even across multiple languages. Clustering and fuzzy matching algorithms help detect and merge duplicate records by recognizing data patterns despite typos, abbreviations, or formatting differences. These systems can achieve over 95% accuracy in duplicate detection.
The flipside of “Junk in, junk out” is “Better data in, better results out.” Not quite as catchy, but just as important. Predictive models utilizing AI can estimate missing values and spot data inconsistencies based on learned data patterns. This lets the system proactively correct itself. AI can automatically classify data by sensitivity (PII, PCI) or domain (sales, logistics), reducing the need for manual labeling.
Feature Description AI Techniques NLP, clustering, fuzzy matching, anomaly detection, predictive modeling Adaptive Learning Self-improving algorithms that refine cleansing rules based on ongoing data processing Real-Time & Batch Processing Instant data validation and large-scale historical data cleansing Automated Error Detection Identifies duplicates, missing values, inconsistencies, and anomalies Scalability Handles massive datasets efficiently using scalable computing frameworks like Apache Spark Integration & Workflow Seamless API integration with existing systems and unified cleansing workflows
Proactive Issue Resolution
As any AI software salesman will tell you, “An ounce of AI-powered prevention is worth a terabyte of cure.” Although many a software salesmen’s promise wilts in the harsh light of implementation, this one is probably true. Preventative care for your data consists of regular checkups (continuous monitoring), vaccines (proactive cleansing), and early diagnosis (predictive analytics). AI does all of this preventive data governance work. It scans data 24/7 for anomalies, flagging issues before they escalate. It “immunizes” datasets by auto-correcting errors while algorithms forecast data quality risks, just like blood tests predict high cholesterol or a hundred other diseases.
AI systems like AIOps work proactively to fix systems before they become problematic. Coined by Gartner in 2017, ‘AIOps’ is a portmanteau of ‘Artificial Intelligence’ (AI) and operations (Ops) referring to the way an IT environment’s data and information are managed by an IT team, specifically regarding AI processes like machine learning and deep learning.
“AIOps combines big data and machine learning to automate IT operations processes, including event correlation, anomaly detection and causality determination,” says Gartner . Following in the footsteps of performance monitoring solutions, like Business Process Management (BPM) and IT Operations Analytics (ITOA), AIOps adds AI and machine learning to the process, allowing it to consume vastly more data types than previous process management technologies. An AIOps solution measures and evaluates everything running through an organization’s IT system and attempts to improve it to such a point that the system almost self-heals.
Automating Tech Support
In his article How AIOps is already transforming IT , Atul Soneja explains how AIOps can help with a company’s technical support: “The AIOps solution automatically opens the ticket and enriches it with log information, events, and metrics before directing it to the right person. Now, all the information is already there, and IT knows what to do with it. All of this is handled automatically behind the scenes, so teams never have to close a ticket manually again.”
Today, many organizations have 20+ IT tools monitoring their systems, a complexity that reflects the evolution of the corporate information factory . AIOps overlays these systems, consuming events and data from all of the monitoring tools, correlates them, and creates some inferred global insights of the company’s entire IT system. Every server has its own unique operational fingerprint and its own distinctive set of processes, structures, schemas, governance, and classifications. AIOps tools create a “trust index” on the company’s data by profiling and cataloging data sources based on standard information quality characteristics , including accuracy, completeness, consistency, timeliness, uniqueness, and validity. Any piece of data falling outside a set standard criterion triggers an alert, informing operations or even IT security personnel about an issue.
He Who Controls the Data Controls the Kingdom
One certainty about analytics is that better data gives you better results. With bad data, however, businesses run the risk of creating analytical processes that produce worthless results. Today, AI isn’t just assisting data stewardship—it’s redefining it, turning reactive oversight into proactive, intelligent data management. AI can supercharge data stewardship by automating tedious tasks, enhancing governance, and improving data quality, while letting human stewards focus on strategy.
The role of a data steward will evolve into more of a strategic business consultant. Data stewards will leverage AI-generated insights to provide contextualized recommendations that improve business processes, requiring uniquely human skills like intuition, problem-solving, and relationship-building that AI cannot replicate. The future of data stewardship and AI is one of integration and evolution. AI will automate many operational tasks, enabling data stewards to focus on higher-value consulting roles that drive business improvements. Modern AI governance solutions will become more real-time, scalable, privacy-conscious, and ethically grounded. It will also transform how organizations manage data as a strategic asset while navigating regulatory complexities.
William the Conqueror wasn’t the first king to count his subjects, but he was among the first to realize that data isn’t just information, it’s authority. By commissioning the Domesday Book, he proved a timeless rule — whoever controls the ledger controls the kingdom. Obviously, not every business needs a Domesday Book. However, a book written almost a thousand years ago by a conquering king teaches us a lot about the necessity of data governance, the importance of accountability, and the value of turning raw information into strategic power. Data isn’t just for record-keeping; it can be a powerful competitive weapon. If a Norman king with parchment paper and ink quills can leverage data to rule an empire, what’s your excuse?