Download Discovering Knowledge Through Data and Text Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Knowledge Discovery with
Data and Text Mining
Dursun Delen, Ph.D.
Associate Professor of MSIS
William S. Spears School of Business
Oklahoma State University - Tulsa
Introduction
2
„
Why data mining?
„
What is data mining?
„
Data Mining Process
„
Data Mining Techniques
„
Text Mining
„
Data & Text Mining Examples
„
A Demonstration of Data Mining in Action
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Data Deluge
Why
Now?
Barcodes
RFIDs
Magnetic Strips
Sensors…
3
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
1
Motivation:
“Necessity is the Mother of Invention”
„
Competitive pressure to make better decisions
„
Data explosion problem (or opportunity)
„
Why do we have more data now, then we had before?
„
„
„
Technology driven reasons: Automated data collection
t l and
tools
d techniques…
t h i
Software driven reasons: Mature database technology…
Cost driven reasons (both hardware and software)
We are drowning in data, but starving for knowledge!
Solution: Data and data mining
4
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
What Is Data Mining?
„
Data mining :
„
„
Alternative names…
„
„
5
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information (or patterns) from data
© D. Delen
Data mining: a misnomer?
Knowledge discovery, knowledge extraction,
data/pattern analysis, data archeology, data
dredging, information harvesting, business
intelligence, etc.
KPM Symposium – Tulsa, OK – October 3-4, 2007
Standardized DM Processes
„
„
CRISP-DM
„
Cross-Industry Standard Process for Data Mining
„
www.crisp-dm.org
SEMMA
9
Sample the data
Explore the data
Modify the data
Model
Assess
„
http://www.sas.com/technologies/analytics/datamining/miner/semma.html
9
9
9
9
6
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
2
Standardized Data Mining Processes
Step
1: Business Understanding
Cross-Industry
Standard
forthe
Data
Mining
„ Process
Determine
business
objectives
CRISP DM
CRISP-DM
„
Assess the situation
„ Established in 1996
„
Determine the data mining
„ By Daimler-Benz,
goals
Integral Solutions Ltd. (ISL),
„ NCR,
Produce
a project plan
and OHRA
Cross-Industry Standard Process
for Data Mining CRISP-DM
7
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Standardized Data Mining Processes
Step 2: Data Understanding
„
Collect the initial data
„
Describe the data
„
Explore the data
„
Verify the data
Cross-Industry Standard Process
for Data Mining CRISP-DM
8
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Standardized Data Mining Processes
Step 3: Data Preparation
„
Select data
„
Clean data
„
Construct data
„
Integrate data
„
Format data
Cross-Industry Standard Process
for Data Mining CRISP-DM
9
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
3
Standardized Data Mining Processes
Step 4: Modeling
„
Select the modeling
technique
„
Generate test design
„
Build the model
„
Assess the model
Cross-Industry Standard Process
for Data Mining CRISP-DM
10
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Standardized Data Mining Processes
Step 5: Evaluation
„
Evaluate results
„
Review process
„
Determine next step
Cross-Industry Standard Process
for Data Mining CRISP-DM
11
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Standardized Data Mining Processes
Step 6: Deployment
„
Plan deployment
„
Plan monitoring and
maintenance
„
Produce final report
„
Review the project
Cross-Industry Standard Process
for Data Mining CRISP-DM
12
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
4
Data Mining Techniques (1)
„
Association (correlation and causality)
„
„
age(X, “20..29”) ^ income(X, “20..29K”) Æ buys(X, “PC”)
[support = 2%, confidence = 60%]
contains(T, “computer”) Æ contains(x, “software”)
[[support
pp
= 1%,, confidence = 75%]]
„
Prediction (Classification & Regression)
„
„
„
13
© D. Delen
Classification: developing models (functions) that describe
and distinguish classes or concepts for future prediction
Predicting whether it will rain tomorrow
Classifying loan applicant as “good” or “bad”
KPM Symposium – Tulsa, OK – October 3-4, 2007
Data Mining Techniques (2)
„
Prediction (Classification & Regression)
„
„
„
„
Cluster analysis
„
„
„
14
Regression: developing models (functions) that forecast the
value of a continuous numerical variable
Predicting what the temperature will be tomorrow
Forecasting the future value of a stock a year from now
© D. Delen
Class label is unknown: Group the samples of objects based
on their quantifiable characteristics
Cluster the customers who share the same characteristics:
interest, income level, spending habits, etc.
Clustering is done by maximizing the intra-class similarity
and minimizing the interclass similarity
KPM Symposium – Tulsa, OK – October 3-4, 2007
Text Mining
„
Text mining is a process that employs
Statistical
Natural
Language
Processing
g
„
Data
Mining
„
„
15
a set of algorithms for converting
unstructured text into structured data
objects, and
the quantitative methods that analyze
these data objects to discover knowledge
Text Mining = Statistical NLP + DM
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
5
Text Mining Process Map
Unstructured Text
Document
Structuring
- Extract
- Transform
- Load
- Documents (.doc, .pdf, .ps, etc.)
- Fragments (sentences,
paragraphs, speech
encodings, etc.)
- Web pages (textual sections,
XML files, etc.)
- Etc.
Problems /
Opportunities
- Environmental
scanning
Term Generation
- Natural Language Processing
- Term Filtering: use of domain
knowledge (stemming, stop
word list, synonym list,
acronym list, etc.)
Structured Text
Feedback
- Tabular representation
- Collection of XML files
- Etc.
Term by Document Matrix
T_1 T_2
Feedback
Doc
1
1
Doc
2
1
...
...
Doc
M
16
© D. Delen
T_N
2
3
1
Information /
Knowledge
- Clustering
- Classification
- Association
- Link analysis
1
KPM Symposium – Tulsa, OK – October 3-4, 2007
Data Mining Applications…
Military Health System (1/3)
„
KBSI’s Phase II SBIR research project
„
„
„
Funded through SBIR program by the Offices of
the Secretary of Defense
SBIR: Phase I ⇒ Phase II ⇒ Phase III
DM in Healthcare
„
„
Managerial
Clinical
Reference
„
17
Delen, D. and S. Ramachandran (2003). A Hybrid Approach to Knowledge
Discovery from Military Health Systems. Journal of Neural, Parallel and
Scientific Computing, Volume 11, Number 1&2, pp. 161-183.
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Data Mining Applications…
Military Health System (2/3)
„
One of the largest health systems in the US
„
„
„
„
„
„
Purpose/mission is to
„
„
„
„
„
18
90+ hospitals
100s of outpatient clinics, treatment facilities
180,000 employees (doctors, nurses, other staff)
8 million beneficiaries
$20 Billion/year
Billi /
budget
b d t
Provide healthcare to eligible veterans
Provide education and training opportunities to health
profession (residency, practical training, etc.)
Conduct medical research, create innovation
Provide public health service at the time of natural disasters
Problem: ⇑ demand, ⇓ budget
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
6
Data Mining Applications…
Military Health System (3/3)
Classification/prediction
„
Demand forecasting
Resource allocation
Load Forecast
DMIS Region
FACILITY
LEVEL
Load Forecast
DMIS Facility 1
Load Forecast
DMIS Facility 2
GRANULAR
„
REGIONAL
LEVEL
AGGREGATE
„
SERVICE
LEVEL
Load Forecast
DMIS Facility 1
Diagnosis Type C, D
„
Association rules
„
19
© D. Delen
Patient diagnosis
Load Forecast
DMIS Facility 1
Diagnosis Type A, B
Load Forecast
DMIS Facility 2
Diagnosis Type X, Y
Load Forecast
DMIS Facility 2
Diagnosis Type P, Q, R
Primary/Secondary Diagnosis
Patient ID
D1
D2
D3
...
10001
10002
10003
.
.
0
0
1
.
.
0
1
1
.
.
1
0
1
.
.
…
…
…
.
.
Association Rules
Disease 724, 653, 366 => 343, 453
Disease 741. 606, 42.01 => 42.05, 35.0, 23.0
Disease 724, 653, 366 => 343, 453
Disease 724, 653, 366 => 343, 453
...
Support 60%, Conf. 40%, Lift 79%
Support 30%, Conf. 50%, Lift 49%
Support 50%, Conf. 50%, Lift 45%
Support 30%, Conf. 40%, Lift 49%
...
KPM Symposium – Tulsa, OK – October 3-4, 2007
Data Mining Applications…
Predicting Breast Cancer Survivability
„
„
Objective: Predicting breast cancer survivability
Methods/Materials:
„
„
„
Used artificial neural networks (MLP), decision trees (C5)
and logistic regression
Used 10-fold cross validation and more than 200,000 cases
Results:
„
Decision tree (C5) is the best predictor with 93.6%
accuracy, artificial neural networks (MLP) came out to be
the second with 91.2% accuracy, and the logistic
regression models came out to be the worst of the three
with 89.2% accuracy
Reference:
Delen, D., G. Walker and A. Kadam (2004). Predicting Breast Cancer
Survivability: A Comparison of Three Data Mining Methods, to appear in
Artificial Intelligence in Medicine.
20
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
A Brief List of
Text Mining Applications
„
Exploring trends in research literature
„
„
„
Using text mining to detect deception
„
„
„
„
„
21
We did this for IS literature
Identifying functional relationships between different
genes using large collections of interdisciplinary
research literature
A l i records
Analyzing
d for
f customer
t
contact
t t centers
t
„
Both audio as well as text (SAS does this)
One of our Ph.D. student working on this
“Reading”, summarizing classifying emails
Text mining the Web content
Anti-terrorism initiatives (CIA is using text mining)
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
7
Forecasting Box Office Success
of Hollywood Movies
Ramesh Sharda and Dursun Delen
Institute for Research in Information Systems
Oklahoma State University
Forecasting BoxBox-Office Receipts:
A Tough Problem!
“… No one can tell you how a movie is
going to do in the marketplace… not
until the film opens in darkened theatre
and sparks fly up between the screen
and the audience”
Mr. Jack Valenti
President and CEO
of the Motion Picture Association of America
23
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Current work on prediction…
„
A lot of research have been done
„
„
„
Our approach
„
„
„
24
Behavioral models
Analytical models
' Predict after the initial release
© D. Delen
Use a data mining approach
Use as much historical data as possible
Make it web-enabled
& Predict before the initial release
KPM Symposium – Tulsa, OK – October 3-4, 2007
8
Our Approach – Movie Forecast Guru
9
DATA – 849 Movies released between 1998-2006
9
Movie Decision Parameters:
‰
‰
‰
‰
‰
‰
‰
‰
9
Intensity of competition rating
MPAA Rating
Star power
Genre
Technical Effects
Sequel ?
Estimated screens at opening
…
Output: Box office gross receipts (flop → blockbuster)
Class No.
Range
(in Millions)
25
© D. Delen
1
<1
(Flop)
2
>1
< 10
3
> 10
< 20
4
> 20
< 40
5
> 40
< 65
6
> 65
< 100
7
> 100
< 150
8
> 150
< 200
9
> 200
(Blockbuster)
KPM Symposium – Tulsa, OK – October 3-4, 2007
Our Approach – Movie Forecast Guru
PREDICTION MODELS
9
Statistical Models:
‰
‰
9
Discriminant Analysis
Ordinal Multiple Logistic Regression
Machine Learning Models:
‰
‰
Artificial Neural Networks
Decision Tree Induction
‰
‰
‰
‰
‰
26
© D. Delen
CART - Classification & Regression Trees
C5 - Decision Tree
Support Vector Machines
Rough Sets
…
KPM Symposium – Tulsa, OK – October 3-4, 2007
Prediction Results…
Models include plots via texttext-mining
Text Mining Models - Tested on 2006 Data
100.00%
90.00%
80 00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
27
No Category
War
Life
Organization
Character
Bingo
49.86%
53.03%
52.16%
51.59%
53.60%
1-Away
88.18%
87.32%
88.18%
89.63%
89.63%
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
9
Sample Predictions…
Movie
Actual
Spider-Man 3
9
9
9
9
9
9
300
9
5
7
5
5
5
War
Organization
Life
Character
The Simpsons Movie
8
9
9
9
9
9
The Bourne Ultimatum
8
6
7
9
9
7
Knocked Up
7
5
5
5
5
7
Live Free or Die Hard
7
9
7
9
9
9
Evan Almighty
6
6
7
7
7
7
1408
6
5
5
4
5
5
TMNT
28
No Category
5
4
3
4
5
5
Music and Lyrics
5
4
5
5
5
5
Freedom Writers
4
4
3
4
4
5
Reign Over Me
3
4
4
4
4
4
Black Snake Moan
2
3
3
3
3
3
Ta Ra Rum Pum
1
1
1
1
1
1
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Web--based DSS
Web
Prediction Models
Remote
Models
Movie Forecast
Guru (MFG)
Local
Models
GUI
(Internet
Browser)
HTML
TCP/IP
User
(Manager)
Remote
Data Sources
Web Services
XML / SOAP
MFG Engine
(Web Server)
ETL
ODBC
& ETL
MFG
Database
XML
Knowledge Base
(Business Rules)
29
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
Demonstration of MGF
30
© D. Delen
KPM Symposium – Tulsa, OK – October 3-4, 2007
10