Text Mining: Finding Nuggets in Mountains of Textual Data Download

Transcript
Text Mining:
Finding Nuggets in Mountains
of Textual Data
Jochen Dörre, Peter Gerstl, and
Roland Seiffert
Overview

Introduction to Mining Text
 How Text Mining differs from data mining
 Mining Within a Document: Feature
Extraction
 Mining in Collections of Documents:
Clustering and Categorization
 Text Mining Applications
 Exam Questions/Answers
Introduction to Mining Text
Reasons for Text Mining
90
80
70
60
50
40
Collections of
Text
Structured
Data
30
20
10
0
Percentage
Corporate Knowledge “Ore”

Email
 Insurance claims
 News articles
 Web pages
 Patent portfolios

Customer complaint
letters
 Contracts
 Transcripts of phone
calls with customers
 Technical
documents
Challenges in Text Mining
 Information
is in unstructured textual
form.
 Not readily accessible to be used by
computers.
 Dealing with huge collections of
documents
Two Mining Phases
 Knowledge
Discovery: Extraction of
codified information (features)
 Information Distillation: Analysis of the
feature distribution
How Text Mining Differs from
Data Mining
Comparison of Procedures
Data Mining
 Identify data sets
 Select features
 Prepare data
 Analyze distribution
Text Mining
 Identify documents
 Extract features
 Select features by
algorithm
 Prepare data
 Analyze distribution
IBM Intelligent Miner for Text
 SDK:
Software Development Kit
 Contains necessary components for
“real text mining”
 Also contains more traditional
components:
 IBM
Text Search Engine
 IBM Web Crawler
 drop-in Intranet search solutions
Mining Within a Document:
Feature Extraction
Feature Extraction
 To
recognize and classify significant
vocabulary items in unrestricted natural
language texts.
 Let’s see an example…
Example of Vocabulary found










Certificate of deposit
CMOs
Commercial bank
Commercial paper
Commercial Union
Assurance
Commodity Futures
Trading Commission
Consul Restaurant
Convertible bond
Credit facility
Credit line

Debt security
 Debtor country
 Detroit Edison
 Digital Equipment
 Dollars of debt
 End-March
 Enserch
 Equity warrant
 Eurodollar
 …
Implementation of Feature
Extraction relies on
 Linguistically
motivated heuristics
 Pattern matching
 Limited amounts of lexical information,
such as part-of-speech information.
 Not used: huge amounts of lexicalized
information
 Not used: in-depth syntactic and
semantic analyses of texts
Goals of Feature Extraction
 Very
fast processing to be able to deal
with mass data
 Domain-independence for general
applicability
Extracted information
categories
 Names
of persons, organizations and
places
 Multiword terms
 Abbreviations
 Relations
 Other useful stuff
Canonical Forms
 Normalized
forms of dates, numbers, …
 Allows applications to use information
very easily
 Abstracts from different morphological
variants of a single term
Canonical Names
President Bush
Mr. Bush
George Bush

Canonical Name:
George Bush
The canonical name is the most explicit, least
ambiguous name constructed from the
different variants found in the document
 Reduces ambiguity of variants
Disambiguating Proper
Names: Nominator Program
Principles of Nominator
Design
 Apply
heuristics to strings, instead of
interpreting semantics.
 The unit of context for extraction is a
document.
 The unit of context for aggregation is a
corpus.
 The heuristics represent English naming
conventions.
Mining in Collections of
Documents: Clustering and
Categorization
1. Clustering

Partitions a given collection into groups of
documents similar in contents, i.e., in their
feature vectors.
 Two clustering engines



Hierarchical Clustering tool
Binary Relational Clustering tool
Both tools help to identify the topic of a group
by listing terms or words that are common in
the documents in the group.
 Thus, provides overview of the contents of a
collection of documents
Groups
documents
similar in their
feature vectors
2. Categorization
 Topic
Categorization Tool
 Assign documents to preexisting
categories (“topics” or “themes”)
 Categories are chosen to match the
intended use of the collection
 categories defined by providing a set of
sample documents for each category
2. Categorization (cont.)
 This
“training” phase produces a special
index, called the categorization schema
 categorization tool returns a list of
category names and confidence levels
for each document
 If the confidence level is low, document
is put aside for human categorizer
2. Categorization (cont.)
 Effectiveness:
Tests have shown that the Topic
Categorization tool agrees with human
categorizers to the same degree as human
categorizers agree with one another.
Set of sample
documents
Training phase
Special index
used to
categorize
new
documents
Returns list
of category
names and
confidence
levels for
each
document
Text Mining Applications
Main Advantages of mining technology
over traditional ‘information broker’
business
 Ability
to quickly process large amounts
of textual data
 “Objectivity” and customizability
 Automation
Applications used to:
 Gain
insights about trends, relations
between people/places/organizations
 Classify and organize documents
according to their content
 Organize repositories of documentrelated meta-information for search and
retrieval
 Retrieve documents
Main Applications
 Knowledge
Discovery
 Information
Distillation
CRI: Customer Relationship
Intelligence



Appropriate documents selected
Converted to common format
Feature extraction and clustering tools are
used to create a database
 User may select parameters for
preprocessing and clustering step
 Clustering produces groups of feedback that
share important linguistic elements
 Categorization tool used to assign new
incoming feedback to identified categories.
CRI (continued)
 Knowledge
Discovery
 Clustering
used to create a structure that
can be interpreted
 Information
Distillation
 Refinement
and extension of the clustering
results
Interpreting the results
 Tuning of the clustering process
 Selecting meaningful clusters

Exam Question #1
 Name
an example of each of the two
main classes of applications of text
mining.
 Knowledge
Discovery: Discovering a
common customer complaint among much
feedback.
 Information Distillation: Filtering future
comments into pre-defined categories
Exam Question #2
 How
does the procedure for text mining
differ from the procedure for data
mining?
 Adds
feature extraction function
 Not feasible to have humans select
features
 Highly dimensional, sparsely populated
feature vectors
Exam Question #3
 In
the Nominator program of IBM’s
Intelligent Miner for Text, an objective of
the design is to enable rapid extraction
of names from large amounts of text.
How does this decision affect the ability
of the program to interpret the
semantics of text?
 Does
not perform in-depth syntactic or
semantic analyses of texts
THE END
http://www-3.ibm.com/software/data/iminer/fortext/