Text Mining

Text mining, sometimes alternately referred to as text data mining, refers generally to the process of deriving high quality information from text. High quality information is typically derived through the dividing of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. ‘High quality’ in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts. Databases are designed for programs to process automatically; text is written for people to read. We do not have programs that can “read” text and will not have such for the forseeable future. Many researchers think it will require a full simulation of how the mind works before we can write programs that read the way people do.

However, there is a field called computational linguistics (also known as natural language processing) which is making a lot of progress in doing small subtasks in text analysis. For example, it is relatively easy to write a program to extract phrases from an article or book that, when shown to a human reader, seem to summarize its contents. (The most frequent words and phrases in this article, minus the really common words like “the” are: text mining, information, programs, and example, which is not a bad five-word summary of its contents.)

In text mining, the goal is to discover heretofore unknown information, something that no one yet knows and so could not have yet written down.

People are using the output of such programs to try to link together information in interesting ways. For example, one can extract all the names of people and companies that occur in news text surrounding the topic of wireless technology to try to infer who the players are in that field. There are a number of companies that are investigating this kind of application.

One problem with these approaches is that it is difficult to recognize which of the many relations that are shown are truly interesting. You’ll immediately see who the big players are, but anyone who knows the business will already be aware of this. You’ll also see many, many weak links between various players, hundreds or thousands of such links, and you can’t tell which are the really interesting ones that you should pay attention to.

