Affiliated with:

Textual Search and Analysis

New Project 34

Textual search and textual analysis are related, but they are not synonymous. Textual search and analysis require the use of accurate metadata.

For many years, text has been the great “black hole” of computer technology. There have been many reasons why text has proven to be such a great mystery. Text does not fit well with the high degree of structure required by a standard database management system (DBMS) or data management programs. Text has many nuances that are hidden “behind the curtain”. There are many languages and each language has its own rules and idiosyncrasies.

Challenge of Language and Text

To consider the challenge faced by the computer scientist in dealing with language and text, consider how language is learned. Children are at an early age – 1 or 2 – when they start to learn language. They learn language from their mothers and fathers and their siblings, generally. As they learn language they learn some basic rules – what is an alphabet, what is a word, how is the word spelled, what does the word mean, how is the word pronounced, what is punctuation, how is a sentence formed, what is a question, and so forth. Every language has its own set of rules.

These “background” rules of language are embedded inside the brain of the child. And eventually the child can speak a language.

But when a computer has to deal with text, the computer has no such education in the background of a language. Indeed a computer must be taught everything. And the rules of language are indeed very complex. Very, very complex.

So it is no wonder that text has been the “black hole” of computers.

Computers, Text and Language

There has been a progression in the attempts to make text more intelligible and more useful to the computer. Perhaps the greatest stride in making text friendly to the computer came in the form of search processing. In search processing a computer application searches text for the appearance of a word. In a more sophisticated version of search, portions of a word can be searched.

There is no question that being able to search a document for the existence and location of a word is a useful thing to do. But searching a document has its limitations. A search can only find the location of something. It cannot know whether the search is what the analyst really intended. Take, for instance, the word “bill” (which happens to be the name of the author!)

One can search a document for the word “bill.” But scanning the document for “bill” does not tell you if it is beak of a bird, the amount of money owed to someone, the cast of characters in a play, or a person. Or perhaps it is ALL of those things. Simply doing a search only indicates the existence and location of a word. It does not identify if the search results are what was desired. An accurate search requires the inclusion of context for the word, which is one aspect of metadata.

Textual Search vs. Textual Analysis

A step up from search is analysis. In analysis, one can tell if the text searched is the right text. Analysis technology is similar to search but is more sophisticated. Analytical processing involves BOTH text and context. In order to make sense out of text the analyst must understand the context of text, among other things. Analytical processing is much more sophisticated than search technology. And as a consequence, analytical processing is much more useful and powerful than simple search processing.

As a simple example of the difference between analytical processing and search processing, consider the following situation. Suppose someone wants to search a standard Bible, with an Old Testament and a New Testament. Certainly, the Bible has many words , approximately 720,000.

A search can be made in the Bible of everyone with the name “james”. This search will lead to many people who happen to be named “james”. At the end of the search, the searcher will know all the instances of the name “james.” However, that is all that one will know about “james” in the Bible.

Now consider analytical processing. Suppose there is a desire to find everyone in the Bible who was “righteous”. Looking under “righteous” will not provide the desired information. The term “righteous” encompasses many qualities and traits. In order to understand who – in the Bible – was righteous, one must have a much deeper understanding of what is meant by “righteous.” Furthermore, some people were righteous at one moment in their lives and unrighteous at other moments in their lives. So asking the question – who, in the Bible, was a righteous person – requires something a lot more complex than a simple search.

Figure 1 depicts the differences between search and analysis –

Image2 1

Figure 1: Differences between search and analysis

What makes things confusing (and vendors of technology use confusion on a daily basis!) is that analysis starts out as search. But very quickly the complexities and nuances of language elevate the activity of analysis to something much greater than a search, and that effort requires the use of accurate metadata.

Conclusion

It is the job of the vendor of technology to convince you that the vendor’s product is what you need. Who is going to know if the vendor of technology confuses things a little to make you think that the vendor’s product can do something that it really doesn’t do? What is a little confusion and legerdemain among old friends?

LinkedIn
Facebook
Twitter

Bill Inmon

Bill Inmon is best-known as the “Father of Data Warehousing” and textual data integration. He has become the most prolific and well-known author worldwide in the data warehousing and business intelligence arena, and has opened the field of textual data integration. In addition to authoring more than 50 books and 650 articles, Bill lectures on data warehousing, textual data integration and related topics. Bill consults with a large number of Fortune 1000 clients, and supports IT executives on data warehousing, business intelligence, and database management issues around the world.

© Since 1997 to the present – Enterprise Warehousing Solutions, Inc. (EWSolutions). All Rights Reserved

Subscribe To DMU

Be the first to hear about articles, tips, and opportunities for improving your data management career.