Affiliated with:

Methods for Managing Unstructured Data

image 93

As the number of types and amount of unstructured data increase, organizations must develop and implement a consistent plan to manage content and records across the enterprise.

According to a recent study by the University of California at Berkeley, about 5 exabytes (5 billion gigabytes) of unique analog and digital information are produced worldwide annually.  That is a data explosion equivalent to half a million new libraries the size of the print collection of the Library of Congress, and this number will continue to expand exponentially.  IBM estimates that about 85 percent of all data is unstructured and about 50 percent of the unstructured data is duplicated.  Therefore, any discussion about a data strategy is incomplete without formulating a tactic for maintaining unstructured data.

Reason for Examining Unstructured Data

The primary reason for the focus on unstructured data is the huge volume of (mostly duplicated) unstructured data incurs high costs for storage and backup operations, not to mention the hidden cost of productivity loss.  In addition, since there is no central strategy for managing and retaining unstructured data, it duplicated many times. This problem is one of the most common issues in enterprise data management.

For example, an engineering organization (an ISO-9000 shop) designed the same part 19 times, and some other parts were re-engineered ranging from 2 to 17 times because the organization did not realize they had already manufactured those parts.  The organization had manufactured over 5,000 parts for various clients, and although the specifications and designs were maintained on digital format, there was no centralized strategy for maintaining and searching designed parts.  Therefore, when the order came from a client to create a similar part, the engineering department did a cursory search only through the client’s previous orders.  When they did not find any matching specifications, they re-engineered the same part – even though it was in the inventory with another name for another client.

The second major reason is that organizations have learned to listen to and enhance the chatter on blogs, forums, and social networks to improve their brand recognition and customer service.  Utilization of techniques such as viral and conversational marketing that motivate brand evangelists and tame unhappy customers have yielded massive ROI to marketing departments.  There is hardly an ounce of structured data in social media.  The ability to harness, track, and analyze this media to an organization’s advantage is huge.

Another major reason for the push to develop a strategy for managing unstructured data is the Sarbanes-Oxley Act of 2002.  Two sections in the Act relate to reports and documents.  One is Title VIII, which makes it a felony to knowingly destroy any documents to “impede, obstruct, or influence” any existing or contemplated federal investigation.  The other section makes it a crime for any person to corruptly alter, destroy, mutilate, or conceal any document with the intent to impair the object’s integrity or availability for use in an official proceeding.  Therefore, all public corporations must ensure that relevant documents are maintained in electronic format for 5-7 years and must be readily available for scrutiny by various governmental entities.

In addition to Sarbanes-Oxley, other legislation in the US, EU, UK, Australia, etc., have forced public organizations to maintain all electronic data (including documents, reports, publications, and even email) for up to 7 years.  In one case, a major financial institution estimated having 350 terabytes of just electronic mail in multiple languages that needed to be stored, retained, and searched when regulators requested.  An example of such a request can contain a search for emails produced by “Joe Smith” between January and March 20xx that contain the word “bribe.”  Imagine the time it would take to respond to such a request having implemented no content management except for the basic backup of shared drives and email servers.  Even if the search involved one email server and one shared drive, it would take months to restore, search, and find every piece of email produced by a specific person in that period containing the word “bribe.”

CTOs, and more importantly, legal departments, cannot wait for a lengthy search.  Yet, in most organizations, unstructured data is created and maintained in content silos where content authors work in isolation.  Such separation causes redundancy, poor communication, lack of standards, and higher costs for creation of the content.  In addition, the search and consumption of such content is harder and more costly for both internal and external users because the content sitting in separate silos is not necessarily inventoried and/or accessible to consumers.

Unified Content Strategy

Ann Rockley, one of the leading consultants in the area of enterprise content management, describes a unified content strategy as “a repeatable method of identifying all content requirements upfront, creating consistently structured content for reuse, managing that content in a definitive source, and assembling content on demand to meet your customers’ needs.”  An organization’s data strategy should anticipate that new data types, content requirements, and data sources – both structured and unstructured – will be added in the future.  As such, the strategy should be standardized to remove the risk of content silos and yet be flexible enough to embrace various new types of data.

Storage and Administration

In the past, the world of structured data evolved from disparate files into central databases administered by database management systems (DBMS).  All DBMS have been designed to perform the same types of tasks regardless of format.  They have methods for storing, manipulating, searching, and retrieving records.  In addition, all DBMS have functions to perform common database operations, such as systematic backups and restores, data reorganizations, index management, and performance management, and securing the data against unauthorized access.

As unstructured data proliferates, the same evolution should happen in managing unstructured data objects.  The recognition of unstructured data as a vital corporate asset has forced leading IT departments to utilize enterprise content management systems (ECMS) for managing all non-structured data objects.  Such systems allow for storing, administering, searching, securing, and retrieving content from a centralized base in the same method that a DBMS allows administering structured data.

ECMSs provide a repository for unstructured objects, allow for capturing metadata for each object for future searches, and create relationships among objects.  They also provide centralized check-in and check-out functionality, backup, restore, and disaster recovery capabilities, enabling users to search across various objects, and secure objects from unauthorized access.  Such products often provide functions that fall outside the scope of structured DBMSs, such as versioning, retention, and archiving.

They also provide electronic workflow management capabilities to end users.  Electronic workflows simplify the routing of documents and other electronic objects.  A great example can be found at a mortgage company when an applicant submits an application online, which is reviewed by a mortgage processor and then electronically forwarded to the manager for approval.

Why not use a DBMS to manage unstructured data objects?  After all, most DBMS products allow for capturing binary large objects (BLOBs) or character large objects (CLOBs).  There are several reasons for not utilizing structured DBMSs.  First, these DBMSs are not designed to manage content.  They are structured data management tools designed to relate relational sets of tables and columns.  Unstructured data objects do not fit into tables and columns.  How would one store many sections of a manual, including pictures, diagrams, indexes, footnotes, fonts, and formats in tables and columns?

Secondly, DBMSs are not designed to manage versioning, retention, and archiving of objects.  Another reason is the size of the database and its management.  There are usually two ways to store LOBs in a DBMS: store it in the database or store it on the file system and store the link for the LOB in the DBMS.  If one uses the first technique, the size of the database is gigantic and basic functionalities, such as backup and restore, take a huge amount of time.  If the second option is chosen, the integrity of the database is compromised because the typical DBMS does not put a restriction on viewing, changing, or deleting the object file, and therefore, an unauthorized person can effectively view or remove a secured file.

Archiving and Retention

Archiving is an important function in the unstructured data world.  The administrators of unstructured data should plan for a time when the active objects (or old versions) should be archived and/or removed from the repository.  Archiving methods are slightly different for unstructured data from its structured relative.  On the structured side, archiving is usually an afterthought.  Several years after deployment of the original database, a request arrives to archive data in a format that can be retrieved later.  With a typical DBMS, there is no out-of-the-box way to archive the basic records.  In an ECMS, an archive flag is associated with each object that identifies the object to be archived.

Retention requirements have applied to non-electronic records for years.  Records retention varies from organization to organization, department to department, and even document (object) type to document type.  The same requirements have been expanded to include electronic documents.  To reduce the risk of deleting required records, the current approach removes the responsibility for retention of an object from all employees and centralizes it in a small group with only a few administrators.

Utilizing specialized records retention software, the administrators create various categories and subcategories that are several layers deep.  They can then set up retention rules for each layer.  The rules vary from a certain period after the creation of the object to a number of years after the termination of its author to dependency on another object.  The same records retention software is used to identify the new and changed files in the network drive (or email server) and then to group them into categories and subcategories (automatically or with the help of administrators).  This software utilizes internal security to prohibit everybody except authorized users from deleting the object.  Any modification to the object will take place in the form of a new version and is subject to all mentioned scrutiny.  All the major ECMSs have introduced their own version of records retention that allows for centrally managing the content and managing the retention rules.

Content Reusability

One of the most important aspects of managing unstructured data is the need and ability to reuse an existing content component in a new object.  For example, an organization needs to create a website and printed marketing material; however, it does not need to create content for each of the objects separately.  It can create a set of content for one, and then use paragraphs, pictures, graphs, and sentences from one in the other.  Utilizing this method, a content change to a specific object automatically changes the content on the other, ensuring the consistency of the message across the organization. Applying relevant, clear, and comprehensive metadata to each record or object can support content reusability.

Search and Delivery

Structured data is stored to be searched, manipulated, and viewed.  Unstructured data objects are no different.  Therefore, a unified content strategy should include methodologies for easy access, search, and delivery of content from various ECMS tools in the organization.  In addition, all enterprise-level ECMS tools allow explicit or implicit capture of metadata to enhance future searches.  They also provide application programming interfaces (APIs) for developers to search the metadata of a specific object.  Most ECMS tools also provide caching methods for faster search and delivery of the content to end users.  Since users access data not only on their PCs but also on their home computers, laptops, PDAs, and cell phones, ensure that regardless of the ECMS tool used, its search and delivery methods are standardized.

Emerging Technologies

Although content has been around for a long time, some organizations have just started to consider it data.  There are some emerging markets and these promise a bright future for products and services in the complete enterprise data management market.

Digital Asset Management Software

The relationship between digital asset management (DAM) software and ECMS is like the relationship of application software (e.g. ERP, CRM) to its DBMS.  ECMS is the heart and soul of a DAM system.  DAM contains applications that resolve a specific operational – and in the future, analytical – need of an organization by providing automatic batch entry, manual entry, workflow, and of course search, retrieval, and archiving of digital content from an ECMS.

A great example of DAM is software developed for the entertainment industry in which the detailed aspects of a movie, including screen play, story boards, still photographs, daily shoots, edited versions, sound tracks, songs, marketing posters, interviews, junket clips, and so on are loaded into a central repository regardless of where the shoot takes place around the world.  In this case, the director, the producers, and the executives can track and view the daily work performed by all teams associated with the movie from the comfort of their homes or offices.  In addition, the software allows for a complete archive of all digital information related to a movie in one central place, assuring availability of all the data associated with a movie for future generations.

Another application of DAM is electronic medical records (EMR) software.  This application allows medical offices and hospitals to maintain information about patients in a digital format and communicate that information to doctors and other clinics via an electronic format.  In addition, the information can allow researchers to access detailed information about a patient and analyze patterns in treatments to suggest better methods of treatment for future patients with the same disease.

Digital Rights Management Software

Copyright laws in the US and most of the world allow the author or rights holder of a document to comment or participate in the reproduction of their material, allowing the author to be paid royalties and licensing fees associated with the use of the material.  Digital rights management (DRM) software utilizes technological methods to enforce these rights.  DRM software focuses on allowing only the users who have the right to utilize specific software to listen to a song or watch a video for as long as they maintain the right.  DRM software products also focus on any digital asset that needs to be protected from unauthorized usage, including competition-sensitive corporate assets.  Imagine a corporate executive who downloaded a confidential report onto his laptop and then loses his laptop.  The person who finds or steals the laptop can break into the system and view the contents of its files.  How does DRM protect against these problems?  In this scenario, the contract between the executive and the object (the confidential report) can be maintained and is always checked through a corporate license server.  When the laptop is lost, the executive will notify a corporate security officer who will remove the rights on all sensitive files from the executive’s laptop.


Unstructured data plays a major role in an organization, and utilizing it correctly can improve processes, save lives, and directly enhance the bottom line.  As such, the need to capture unstructured data and make it available to other parts of the organization must be an important part of any organization’s data strategy.

This article is an excerpt from the book Data Strategy (Addison-Wesley, 2004) by Sid Adelman, Larissa T. Moss and Majid Abai.  It is based on Chapter 11 “Strategies for Managing Unstructured Data” written by Majid Abai and reprinted with permission.


Larissa Moss

Larissa Moss is founder and president of Method Focus Inc., a company specializing in improving the quality of business information systems. She has extensive IT experience with information asset management, data warehousing, Business Intelligence, CRM, data integration and cross-organizational development, as well as project management, data modeling, data quality assessment, data transformation and cleansing, and metadata management. Ms. Moss is the author of several books and numerous articles and white papers on a variety of subjects in her areas of expertise.

© Since 1997 to the present – Enterprise Warehousing Solutions, Inc. (EWSolutions). All Rights Reserved

Subscribe To DMU

Be the first to hear about articles, tips, and opportunities for improving your data management career.