-
Marketplace
-
Channel Resources
Articles from this Site
AIIM and SPC Corporate Training form Education Partnership
QuadraMed Offers Hospital Registration Application with InterSystems
Avocado Security Unveils Security and Business Intelligence Platform
Actuate and Webalo Deliver Integrated Mobile Solution
Sisense Launches Prism Business Intelligence Software
White Papers
HP ERP Business Intelligence
Business Intelligence for Tax Planning: Value, Strategy, and Vision
Single Sign-On for Webintelligence
A Structured Method for Specifying Business Intelligence Reporting Systems
Business Intelligence in a Real-Time World
Web Seminars
Looking for speed and accuracy in your financial planning and budgeting?
Hyperion Visual Explorer: Improve Visibility into Performance Management
Reducing the Cost of Deploying and Managing Data
Combining Microsoft Business Intelligence with the Teradata Warehouse
Espresso Shot Web Seminar: Uncorking the Data Bottleneck with Operational BI
Books
A Raid on the Inarticulate: Text Mining Hits an Inflection Point
Data Strategy Adviser
In early 2007, FirstData scooped up Intelligent Results, an innovator in the use of text mining for predictive analytics. The subsequent acquisition by Business Objects of Inxight, described as a text mining and analysis company, occurred in May 2007 and was followed by the acquisition by Reuters of ClearForest, solidifying the trend. Meanwhile, Bill Inmon is contributing to catalyzing the trend by launching a start up with nine patent filings and an innovative approach to textual extract, transform, and load (ETL) that leverages the standard relational database and token parsing, avoiding the pitfalls and complexities of natural language processing while delivering results similar in scope and value. This does not make Business Objects a data aggregator or newswire service any more than it makes Reuters or FirstData a business intelligence (BI) company. However, it does show that the creation of information out of unstructured data represents the zone of proximal development in redefining the limits of what is possible using IT.
In both BI and content aggregation, it is what you don't know that can hurt you. The business cannot even express the terms of the relevant issue and remains inarticulate - until trouble strikes. Really big business problems have occurred when enterprises did not have the right answers because they were not even asking the right questions. It is this second order ignorance - I do not know what I do not know - that is most dramatically demonstrated in such front-page business meltdowns. Catastrophic failure in automotive tire tread separation, the backdating of options in the context of executive compensation, distress in the sub-prime mortgage lending market are all similar in that the answers as well as the questions were outside the focus of business awareness, analysis and problem solving. The resulting surprises have been both costly and painful.
The point is not to say that text mining is a silver bullet that will guard against any random business risk. Rather the point is that text mining is a powerful method of managing business risks as well as opening up new opportunities for profiting from discovery of the underlying mechanisms and causes that determine buying behavior, rule following and leading indicators of trends. Unstructured data is a vast realm where "I do not know what I do not know." Analysis of this data using text mining methods can provide early and leadings indications of trends, actionable predictions about customer behavior and confirmation for structured variables in the environment such as cost or product returns or complaints. (This article will not even touch on standards in the emerging market for text mining; and in the interest of completeness, Unstructured Information Management Architecture (UIMA) is one that has legs and deserves mention.1)
However, before you hand off this article to your executive administrator with the instructions "Get me one of those," it would make sense to take a closer look under the hood at the challenges, trade-offs and promises of text mining.
Generally, "text mining" works with a data store of written statements. These may be case notes from a call center. The text may be helpdesk problem ticket narratives or email correspondence with customers and clients, either external or internal. For example, Intelligent Results developed a solution for a collection agent writing up what happened when the collector called the person in arrears to invite them to pay something on the overdue account. You get abbreviations such as "HG" - hung up - or more verbose explanations such as "lost job due to ill health, but now back in the market - recommended payment plan A." If the debtor is in jail, then he is not a good candidate for a payment plan for obvious reasons - no prospects of income. Further calls will be a waste of time and effort, and the debt is a bad one to be written off. On the other hand, if the person is a college graduate, but just down on his luck due to loss of employment, illness or other life misfortune, then the prospects of collecting in the future are good. Action is required.
While it is an oversimplification, many text mining technologies go through the following series of steps in order to bring order and significance to what is otherwise a jumble of unstructured data. It remains true that if you can't structure it, you can't manage it. This process provides structure to the data, though not necessarily the kind of structure characteristic of the end result of a relational database. The first step is usually to determine the language of the text data. This makes a difference, for example, since in Spanish the adjectives sometimes follow the noun whereas in English they usually precede it. The fnext step consists in identifying and eliminating "noise words" such as "the" and "a" and a multiplicity of pronouns and adverbs as well as proper names and places that, while significant, do not contribute to the generation of meaning at the appropriate level for the problem. This elementary data scrubbing is followed by tokenization. This breaks up the text into identifiable entities and actions by means of automated stringing and unstringing based on common delimiters between words. At this point, the tokens may be further analyzed based on mapping to a dictionary of key terms relating to a particular problem domain ("semantic ontology") such as debt collection, complaint hot line in a given domain such as automotive or retail, intellectual property ("patent") descriptions, biochemical reactions or law enforcement issues. The resulting semistructured information is subjected to further analysis by means of statistical probability of occurrence of tokens, classification and clustering algorithms. These functions associate related terms based on frequency nearness of occurrence in the text. Tagging or indexing of the tokens or associated clusters is useful for further analysis, including search and discovery. Visualization of the resulting clusters, which often map to specific concepts such as customer, product, promotion - in retail or disease, symptom, treatment - in health care, is a common enhancement and usability differentiator.
One active debate among the text mining technology researchers and vendors is how far in the direction of natural language processing (NLP) and understanding, this approach ought to go. Rules or a combination of dictionaries and rules can analyze quantifiers such as "any," "some," "all" and adverbs such as "now," "then," "not yet" that provide meaning to identified entities and events. Another approach avoids the complexities of natural language processing, which after all is a computing grand challenge, and sticks with simple data scrubbing, tokenization and clustering. This reduces overall complexity, computing cycles and semantic overhead; but at the cost of requiring more expert analysis at the back end to make sense of what is generated. The interesting thing about the approach that avoids NLP is that it works, at least at an entry level, regardless of the language in which the tokens are encoded. In other words, since we are not concerned with the meaning ("semantics"), the language might as well be German where the word for customer is "Kunde." The results of clustering will be the same as English, though obviously the meaning will have to be added in at the back end by an informed user speaking the language.
The state of the art of text mining is characterized by the innovations at a small French company called Temis. Temis offers what it describes as "cartridges" in various concept domains such as biology, chemistry, life sciences, human resources and competitive intelligence. These provide a basic vocabulary and concepts in the relevant domains to establish organization of the tokenization process as the data is extracted. These "plug-ins" perform semantic functions and are more commonly known as "annotators" in the text mining community. Rather than have to start with a complete blank slate as to what the text is about - the universe of possible conversations is enormous - the analysis gets a "head start" on what is the relevance of the raw data. It remains a bold statement of the obvious that text mining provides new methods to conduct a raid on the inarticulate and, using information technology, redefines the boundaries of what is possible.2
References:
- For a high level overview see -http://researchweb.watson.ibm.com/journal/50th/methodologies/ferrucci.html.
- I acknowledge Richard Hale of IBM Worldwide Business Intelligence for useful comments on an earlier version of this article. Thanks Richard!
Lou Agosta is an independent industry analyst in data warehousing. A former industry analyst at Giga Information Group, Agosta has published extensively on industry trends in data warehousing, data mining and data quality. He can be reached at LAgosta@acm.org.ûû
For more information on related topics, visit the following channels:


