Download PowerPoint Presentation - Computer Science

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
DATA MINING:
Algorithms, Applications and Beyond
Chandan K. Reddy
Department of Computer Science
Wayne State University, Detroit, MI – 48202.
Organization



Introduction
Basic components
Fundamental Topics




Research Topics





Classification
Clustering
Association Analysis
Probabilistic Graphical Models
Boosting Algorithms
Active Learning
Mining under Constraints
Teaching
Lots of Data ….





Customer Transactions
Bioinformatics
Banking
Internet / Web
Biomedical Imaging
So What ?????


Computers have become cheaper and more
powerful, so storage is not an issue
There is often information “hidden” in the data
that is not
Wereadily
are evident
drowning in data,

butanalysts
starving
forweeks
knowledge!!!
Human
may take
to discover
useful information

Much of the data is never analyzed at all
Data Mining is …

“the nontrivial extraction of implicit, previously
unknown, and potentially useful information
from data”

“the science of extracting useful information
from large data sets or databases”
-Wikipedia.org

More appropriate term will be ….
Knowledge Discovery in Databases
Steps in Knowledge Discovery
Steps in the KDD Procedure

Data Cleaning


Data Integration


(application of intelligent methods in order to extract data patterns)
Model Evaluation


(converting data into a form more appropriate for mining)
Data Mining


(only data relevant for the task are retrieved from the database)
Data Transformation


(combining multiple sources)
Data Selection


(removal of noise and inconsistent records)
(identification of truly interesting patterns representing knowledge)
Knowledge Presentation

(visualization or other knowledge presentation techniques)
What can Data mining do?






Figures out some intelligent ways of handling the data
Finds valuable information hidden in large volumes of data.
Analyze the data and find patterns and regularities in data.
Mining analogy: in a mining operation large amounts of low
grade materials are sifted through in order to find something of
value.
Identify some abnormal/suspicious activities
To provide guidelines to humans - what to look for in a dataset?
Related CS Topics
Pattern
Recognition
Database
Systems
Artificial
Intelligence
Data Mining
Machine
Learning
Visualization
Optimization
Algorithms
Statistics
Typical Data Mining Tasks are …

Prediction Methods (You know what to look for)


Use some variables to predict unknown or future values
of other variables.
Description Methods (you don’t know what to look for)

Find human-interpretable patterns that describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Basic components






Data Pre-processing
Data Visualization
Model Evaluation
Classification
Clustering
Association Analysis
Different kinds of Data

Record Data
 Data Matrix
 Document Data
 Transaction Data

Graph Data

Ordered
 Temporal Data
 Sequence Data
 Spatio-Temporal Data
Record Data

Data that consists of a collection of records, each of
which consists of a fixed set of attributes
Document Data

Each document becomes a `term' vector,
 each term is a component (attribute) of the vector,
 the value of each component is the number of times the
corresponding term occurs in the document.
Transaction Data
A


special type of record data, where
Each record (transaction) involves a set of items.
The set of products purchased by a customer during one
shopping trip constitute a transaction, while the individual
products that were purchased are the items.
TID
Items
1
Bread, Coke, Milk
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Graph Data
 Data

with Relationships among objects
Examples: (a) Generic Web Data
(b) Citation Data Analysis
2
1
5
2
5
Ordered Data

Time Series data – series of some measurements taken
over certain time frame

E.g. financial Data
Ordered Data

Sequence data – no time stamps, but order is
still important. E.g. Genome data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data

Spatio-Temporal Data
Average Monthly Temperature of land and ocean collected for a
variety of geographical locations ( a total of 250,000 data points)
Data Pre-Processing

Removal of noise and outliers


Sampling is employed for data selection


Curse of dimensionality
Data Normalization


Processing entire Data might be expensive
Dealing with High-dimensional data


Will improve the performance of mining
Different features have different range values
e.g. human age, height, weight.
Feature Selection

Remove unnecessary features – redundant or irrelevant
Data Visualization

Visualization is the conversion of data into a visual or tabular
format so that the characteristics of the data and the relationships
among data items or attributes can be analyzed or reported.
Histograms
Pie Chart
Scatter Plot Array of Iris Attributes
Contour Plot Example:
Celsius
Parallel Coordinates Plots for Iris Data
Chernoff Faces for Iris Data
Setosa
Versicolour
Virginica
A Sample Data Cube
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
sum
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TV in U.S.A.
Organization



Introduction
Basic components
Fundamental Topics




Research Topics





Classification
Clustering
Association Analysis
Probabilistic Graphical Models
Boosting Algorithms
Active Learning
Mining under Constraints
Teaching
Classification
Training
Algorithm
Learn Model
Apply Model
Existing
Data
New
Data ???
Training Phase
Result
Testing Phase
Classification models
Outlook
Sunny
Rainy
Overcast
High
No
Windy
Yes
Humidity
Normal
Yes
True
No
False
Yes
Metrics for Performance Evaluation
PREDICTED CLASS
Class=Yes
Class=Yes
ACTUAL
CLASS Class=No
a
(TP)
c
(FP)
Class=No
b
(FN)
d
(TN)
Most widely-used metric:
ad
TP  TN
Accuracy 

a  b  c  d TP  TN  FP  FN
Evaluating Data Mining techniques

Predictive Accuracy (ability of a model to predict future) or

Descriptive Quality (ability of a model to find meaningful
descriptions of the data, e.g. clusters)

Speed (computation cost involved in generating and using the
model)

Robustness (ability of a model to work well even with noisy or
missing data)

Scalability (ability of a model to scale up well with large
amounts of data)

Interpretability (level of understanding and insight provided by
the model)
Clustering





No class Labels – so, no prediction
Groupings in the data (descriptive)
Can be used to summarize the data
Can help in removing outliers and noise
Image segmentation, document clustering, gene
expression data etc..
Association Analysis

Given a set of transactions, find rules that will predict
the occurrence of an item based on the occurrences of
other items in the transaction
Market-Basket transactions
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper}  {Beer},
{Milk, Bread}  {Eggs,Coke},
{Beer, Bread}  {Milk},
Implication means co-occurrence,
not causality!
Organization



Introduction
Basic components
Fundamental Topics




Research Topics





Classification
Clustering
Association Analysis
Probabilistic Graphical Models
Boosting Algorithms
Active Learning
Mining under Constraints
Teaching
Probabilistic Graphical Models




Real World Data is very complicated
We would like to understand the underlying
distribution that generated the data
If it is unimodal, then it is easy to solve
But, usually the distribution is multimodal – not
unimodal
Parameter Estimation

Modeling with Probabilistic Graphical Models
 Mixture Models
 Hidden Markov Models
 Mixture-of-Experts
 Bayesian Networks
 Mixture of Factor Analyzers
 Neural Networks
 And so on…..
We don’t want Sub-optimal models
Example
Motivation
?
? ?
?
?
?
“Searching for a needle in hay stack”
Problems with Local Optimization
Local methods suffer from “fine-tuning ”
capability and there is a need for a method that
explores a subspace in a systematic manner.
TRUST-TECH Approach
Systematic Tier-by-Tier search
Mixture Models


Let x = [ x1, x2,…, xd ] T be the d - dimensional feature vector
Assumption : K components in the mixture model.
k
p  x      i p  x | i 
i 1

Let  = { 1, 2,…, k, 1, 2,…, k } represent the collection of
parameters
1
p  x | i  
e
 2
0  i  1

 x  i 
i  1, 2,..., k and
2 i2
k

i 1
i
1
Maximum Likelihood Estimation

Let X = { x(1), x(2),…, x(n) } be the set of n i.i.d samples
log p  X    log  p  x ( j ) |     log   i p  x ( j ) | i 

n
n
k
j 1
j 1
i 1
Goal : Find  that maximizes the likelihood function
ˆ

MLE  arg max log p  X |  


Difficulty : (i) No closed-form solution and
(ii) The likelihood surface is highly nonlinear
EM Algorithm


Initialization : Set the initial parameters 
Iteration : Iterate the following until convergence

E-Step : Compute the Q-function i.e. expectation of the log
likelihood given the current parameters
(t )

Q  ,    EZ l og p  X , Z |   | X ,  
(t )

M-Step : Maximize the Q-function with respect to 

 t 1

 arg max Q  | 

t 

Nonlinear Transformation
Minimize f ( x)
f : R  R, f  C
N
Original Function
2
x (t )  f ( x)
Dynamical System
one-to-one correspondence of the critical points
Local Minimum
Saddle Point
Local Maximum
Likelihood Function
[ JCB ’06 ]
Stable Equilibrium Point
Decomposition Point
Source
Energy Function
Experimental Results
[ IEEE PAMI ’08 ]
Finding Motifs using Probabilistic Models
J
k=b
k=1
k=2
k=3
k=4
…
k=l
{A}
C0,1
C1,1
C2,1
C3,1
C4,1
…
Cl,1
{T}
C0,2
C1,2
C2,2
C3,2
C4,2
…
Cl,2
{G}
C0,3
C1,3
C2,3
C3,3
C4,3
…
Cl,3
{C}
C0,4
C1,4
C2,4
C3,4
C4,4
…
Cl,4
Results
Results
Alignment Score
200
180
Original
160
Tier-1
Tier-2
140
(2
0,
6)
(1
7,
5)
(1
5,
4)
(1
3,
3)
(1
1,
2)
120
Motifs
Different Motifs and the average score using random starts.
The first tier and second tier improvements
[ BMC AMB ’06 ]
Neural Network Diagram
x1
x2
x3
xn
w11
wnk
b1
b2
w01
w02
bk
w0 k
bk 1
y
Inputs : xi
Output : y
Weights : wij
Biases : bi
Targets : t
# of
# of
# of
# of
Input Nodes : n
Hidden Layers : 1
Hidden Nodes : k
Output Nodes : 1
1 Q
2
C(w)   t (i)  y (i, w, x)
Q i 1
Results – Classification Error (%)
[ IJCNN ’07 ]
Train
Test
Best
BP
TRUSTTECH+BP
Improve
ment(%)
Best
BP
TRUSTTECH+BP
Improve
ment(%)
Cancer
2.21
1.74
27.01
3.95
2.63
50.19
Image
9.37
8.04
16.54
11.08
9.74
13.76
Ionosphere
2.35
0.57
312.28
10.25
7.96
28.77
Iris
1.25
1.00
25.00
3.33
2.67
24.72
Diabetes
22.04
20.69
6.52
23.83
20.58
15.79
Sonar
1.56
0.72
116.67
19.17
12.98
47.69
Wine
4.56
3.58
27.37
14.94
6.73
121.99
Boosting Algorithms for
Biomedical Imaging
T
Training
phase
Learned
Models
T1
T2
…
TS
h1
h2
…
hS
(x, ?)
h* = F(h1, h2, …, hS)
Testing
phase
(x, y*)
Tumor Detection and Tumor Tracking must
be performed in almost real-time
Wavelet features are good classifiers but not very good
Medical Image Retrieval
using Boosting Methods
Retrieving similar medical images is very valuable for
diagnosis (automated diagnosis systems)
Each category is trained separately and different
models are learned
Given a query image, the most similar images are displayed
Identification of Microbes
Segment the objects by accurately identifying the boundaries
Semi-automated methods perform very well
Apply Active Learning Methods for labeling the pixels
Results
[ JMA ’04 ]
Active Learning for Biomedical Imaging



Labeling/Annotating Images is a daunting task
We need help the medical doctors to efficiently label the images
Rather than showing the images at random order, Active
Learning can pick the most hard ones
Mining Under Constraints


Business problems pose many real-world constraints
Obviously training models without the knowledge of
these constraints do not perform well [ submitted ]
Constraints
Learn
Model
Training Phase
Apply
Model
Testing Phase
Mining Under Constraints
Constraints
Learn
Model
Training Phase
Learn
Constraints
Model
Apply
Model
Testing Phase
Apply
Model
Conclusion
Different Data Mining related tasks are discussed in
general


Core data mining algorithms are illustrated
Data Mining helps existing technologies but it doesn’t
override them


Few challenges still remain unsolved
Problems like parameter estimation and automated
parameter selection are still on-going research tasks
 Handling real-world constraints
 Incorporating domain knowledge during the training phase

Teaching
Fall 2007 : CSC 5991
Data Mining I – Fundamentals of Data Mining

http://www.cs.wayne.edu/~reddy/Courses/CS5991/
Winter 2008 : CSC 7991
Data Mining II – Topics in Data Mining

http://www.cs.wayne.edu/~reddy/Courses/CSC7991/
Data Mining I ( Fall 2007 )
This course introduces the fundamental principles,
algorithms and applications of data mining.


Topics covered in this course include:
data pre-processing
 data visualization
 model evaluation
 predictive modeling
 association analysis
 clustering
 anomaly detection.

Data Mining II ( Winter 2008 )
This will be a continuation course. Data mining
problems that arise various application domains will be
discussed. (No Prereq: special classes)

The following topics will be covered:
Data Warehousing
Mining Data Streams
Probabilistic Graphical Models
Frequent Pattern Mining
Multi-relational Data Mining
Graph Mining
Text Mining
Visual Data Mining
Sequence Pattern Mining
Mining Time-Series Data
Privacy-preserving Data Mining
High-Dimensional Data
Clustering
Thank You
Questions and Comments!!!!!!
Contact Information :
Office : 452 State Hall
Email : [email protected]
WWW : http://www.cs.wayne.edu/~reddy/