Affiliated with:

121 1

Text, the basis of language, can be considered to be unstructured by technicians. However, for others, text and language have a definite structure.

It is the simplest of questions. Ask any technician if text – language – is structured or unstructured. If your technician is knowledgeable, the technician will assure you that text is unstructured. In fact the technician will think that you are sort of silly for even asking such a trivial question. Everybody knows that text is unstructured. Don’t they?

Now ask an English teacher the same question. Is text – language – structured or unstructured – and the English teacher too will think this to be a silly question. The English teacher will assure you that text is structured. Actually text is highly structured. Many organizations consider text to be part of data management, which generally deals with structured data.

As proof of the sophisticated structure of text the English teacher refers you to the classical work – Strunk’s Elements Of Style. The English teacher has PROOF that text – language – is structured.

The technician and the English teacher are saying the OPPOSITE things. And they are both bright people. So who is wrong and who is right?

In order to unravel this twisted knot we need to step back and examine the question again. Maybe the question is the problem, not the answer to the question.

So let’s look at the question – what does the word “unstructured” really mean?

The English teacher’s perspective of unstructured refers to – is there a body of thought – a body of rules – that govern what we mean by text. And there is NO QUESTION that there is such a body. We have rules of spelling. We have rules of grammar. We have rules for definitions of words. We have rules of punctuation. We have rules of thought development. It is simply true – text – language – is full of rules. And it is the application of those rules that allow text to have structure.

So the English teacher MUST be right. Text – language – is highly structured.

Now let’s talk to the technician. When we start talking to the technician we find that putting text into a standard relational database management system is a very difficult thing to do. Trying to put text into a standard data base management system is like trying to put a square peg into a round hole. A VERY square peg into a VERY round hole.

In the best of cases it is an awkward fit. In the worst of case it is no fit at all.

There are a thousand reasons why a standard database management system has a hard time holding and managing text. Some of those reasons are –

Text is erose. Some sentences are short. Some sentences are long. Some sentences are declarative. Some sentences are questions. There is no pattern to the way that sentences are constructed.

The same text means different things in different places. In one place the word “court” refers to a place where people play basketball. In another place “court” refers to a place where a trial is held. How confusing is that? The same word has entirely different meanings, as we learn when we study metadata management.

Text is in different languages. In one place, text is in English. In another place, text is in Spanish. In yet another place text is in Mandarin. In Mandarin, the way text is represented is extremely different from the way that text is represented in English or in Spanish.

Some parts of text are extraneous. The words “a”, “and”, “the”, “for”, “that” etc. are best removed. Those words do not have any effect on the meaning of what is being said.

And the list of differences goes on.

The technician points out that you can stuff text into a database management system, but it won’t mean much if you do. Even if you can stuff text into a database management system, you can’t really do anything with it when you finish stuffing it in

So the technician has a point. Text is highly unstructured.

Now we get to the nexus of the matter. When you ask the English teacher if text is unstructured, the English teacher has one idea of what is meant by the word “unstructured”. When you ask the technician if text is unstructured, the technician has an entirely different idea of what is meant by the word “unstructured”.

In reality BOTH the English teacher and the technician are right. Both of these people are smart people. It is just that they have an entirely different interpretation of what is meant by the word “unstructured”.


Bill Inmon

Bill Inmon is best-known as the “Father of Data Warehousing” and textual data integration. He has become the most prolific and well-known author worldwide in the data warehousing and business intelligence arena, and has opened the field of textual data integration. In addition to authoring more than 50 books and 650 articles, Bill lectures on data warehousing, textual data integration and related topics. Bill consults with a large number of Fortune 1000 clients, and supports IT executives on data warehousing, business intelligence, and database management issues around the world.

© Since 1997 to the present – Enterprise Warehousing Solutions, Inc. (EWSolutions). All Rights Reserved

Subscribe To DMU

Be the first to hear about articles, tips, and opportunities for improving your data management career.