Textual data comes in a variety of formats, which information technology (IT) specialists and end-users must be aware of and address when developing and using textual solutions.
When you ask a person about processing text, the usual reaction is to simply say – “well you just read text…” Most people do not give a second thought to reading text. They think that text is just text and that is all there is to it.
But that is like saying that a person is just a person. When you get down to it there are many different types and variations of types of people. There are old people and there are young people. There are men and there are women. There are short people and there are tall people. There are slender people and there are husky people. There are educated people and there are less than educated people. There are people from the city and there are people from the countryside. So it is not enough simply to say that people are people. To make sense of what you are saying you need to specify what kind of person you are talking about; you must put the text into context with metadata.
The same is true of text. There are as many varieties of text as there are varieties of people. And, when you get ready to do text processing you need to be prepared to handle all kinds of text.
So what exactly are the different kinds of text that there are? What are some of the variations? As described in the book, “Turning Text into Gold”, some of the different kinds of text include:
- Formal text, where words and thoughts are expressed according to a well-defined grammar. Formal text is what your teacher gives you an A for in school.
- Informal text, where words include slang and street expressions. In this case, the text does not appear in any particular fashion. Informal text may be what a street gangster says to his gang member.
- Unpredictable text, where there is no prescribed order from one word to the next. A classic example of unpredictable text is email. A person can write anything they want in an email, in any order they desire.
- Predictable text, where there is a predictable, prescribed order to the text. A classic example of predictable text is legal “boilerplate” where the same contract (or very similar contract) appears repeatedly. On the other hand, there might be laboratory results, where a hospital describes the results of tests made on a patient.
Examples of variety of textual data could include:
And this is just the tip of the iceberg. There are LOTS of other variations of text. It is important to understand the various forms of textual data.
So when you get ready to read text and process text, you need to be prepared to handle ALL these variations of text. If you are going to be serious about reading text and processing it, you need to be ready to handle ALL variations of text, not just some text. Text should become part of your enterprise data management program.
It is debatable – are there more variations in people or are there more variations in text? That is an imponderable that probably no one knows and for which there is no answer.
Text is not text is not text.