Download Data Analysis

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Data Analysis
Santiago González
<[email protected]>
Contents






Introduction
CRISP-DM
Tools
Data understanding
Data preparation
Modeling (2)





(1)
(1)
Association rules?
Supervised classification
Clustering
Assesment & Evaluation (1)
Examples: (2)





Neuron Classification
Alzheimer disease
Meduloblastoma
CliDaPa
…
Special Guest
Prof. Ernestina Menasalvas
“Stream Mining”
Data Analysis
Data Mining: Modeling
Data Analysis
Data Mining Tasks
 Prediction

Methods
Use some variables to predict unknown or future
values of other variables.
 Description

Methods
Find human-interpretable patterns that describe the
data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Analysis
Data Mining Tasks...
 Association
Rule Discovery [Descriptive]
 Classification [Predictive]
Supervised cl.
 Regression [Predictive]
 Clustering [Descriptive]
Unsupervised cl.
Data Analysis
Data Mining Tasks...
 Association
Rule Discovery [Descriptive]
 Classification [Predictive]
 Regression [Predictive]
 Clustering [Descriptive]
Data Analysis
Association Rule Discovery
 Given
a set of records each of which
contain some number of items from a
given collection;
 Produce dependency rules which will
predict occurrence of an item based
on occurrences of other items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Data Analysis
Association Rule Discovery

Example:
 Let the rule discovered be
{Bagels, … } --> {Potato Chips}
 Potato Chips as consequent => Can be used to
determine what should be done to boost its
sales.
 Bagels in the antecedent => Can be used to see
which products would be affected if the store
discontinues selling bagels.
 Bagels in antecedent and Potato chips in
consequent => Can be used to see what
products should be sold with Bagels to promote
sale of Potato chips!
Data Analysis
Data Mining Tasks...
 Association
Rule Discovery [Descriptive]
 Classification [Predictive]
 Regression [Predictive]
 Clustering [Descriptive]
Data Analysis
Classification: Definition



Given a collection of records (training set )
 Each record contains a set of attributes, one of
the attributes is the class (categorical).
 Class may be binary o not…
Find a model for class attribute as a function of
the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
 A testing set is used to determine the accuracy
of the model. Usually, the given data set is
divided into training and test sets, with training
set used to build the model and test set used
Data Analysis
to validate it.
Classification Example
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
Classifying Galaxies
Early
Class:
• Stages of Formation
Courtesy: http://aps.umn.edu
Attributes:
• Image features,
• Characteristics of light
waves received, etc.
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Data Analysis
Classification
Data Analysis
Cross validation






Well classified: (a+d)/Sum
Wrong classified: (c+b)/Sum
True positive (sensibility): a/a+c
True negative (specificity): d/b+d
False positive: b/a+c
False negative: c/b+d
Data Analysis
Classification: example






Well classified:
Wrong classified:
True positive (sensibility):
True negative (specificity):
False positive:
False negative:
Data Analysis
Classification: example






Well classified: 4/6
Wrong classified: 2/6
True positive (sensibility): 2/3
True negative (specificity): 2/3
False positive: 1/3
False negative: 1/3
Data Analysis
Classification
Data Analysis
KNN



Idea: use information of the k nearest
neighbours.
We need to calculate the distance between
samples in order to know who is nearest
(euclidea, manhattan, etc.)
Prior info:




Number of neighbours: K
Distance function: d(x,y)
Learning data
Testing data
Data Analysis
KNN
 Euclidean
distance
 Manhattan


distance
Quite similar
Difference: absolute
squared value
value
instead
of
Data Analysis
KNN
 Example
with K = 3, two attributes and
euclidean distance
Data Analysis
ID3

Objective:

Create a decision tree as a method to
approximate a target function based on discrete
values​​






Resistant to noise in the data
Is able to find or learn of a disjunction of expressions.
Result can be expressed as rules: if-then
Try to find the simplest tree that separe better the
samples.
It is a recursive algorithm
Use information gain
Data Analysis
ID3
Data Analysis
ID3
 The
most discriminative feature is the one
with more Information Gain:
G (C,Attr1) = E (C) - ∑ P(C|Attr1=Vi) * E (Attr1)
where
E (Attr1) = - ∑ P(Attr1=Vi ) * log2(P(Attr1=Vi )) =
= - ∑ P(Attr1=Vi ) * ln(P(Attr1=Vi )) / ln(2)
Data Analysis
ID3: example
This feature is
important??
Clasificación Supervisada
Data Analysis
ID3: example
G(AdministrarTratamiento,Gota) = G(AT,G)
G(AT,G) = E(AT) – P(G=Si) x E(G=Si) – P(G=No) x E(G=No)
E(G=Si) = - P(AT=Si|G=Si) * log2(P(AT=Si|G=Si)) - P(AT=No|G=Si) *
log2(P(AT=No|G=Si)) =
= - 3/7 * log2 (3/7) – 4/7 * log2 (4/7) = 0.985
E(G=No) = - P(AT=Si|G=No) * log2(P(AT=Si|G=No)) - P(AT=No|G=No) *
log2(P(AT=No|G=No)) =
- 6/7 * log2 (6/7) – 1/7 * log2 (1/7) = 0.592
E(AT)=- P(AT=Si)* log2(P(AT=Si)) - P(AT=No)* log2(P(AT=No)) =
= - 9/14 * log2(9/14) - 5/14 * log2(5/14) = 0.940
G(AT,G) = 0.94 – P(G=Si) x 0.985 – P(G=No) x 0.592 =
= 0.94 – (7/14) x 0.985 – (7/14) x 0.592 = 0.151
Data Analysis
ID3: example
Data Analysis
ID3: example
Data Analysis
ID3: example
Data Analysis
Bayes Classifier
A
probabilistic framework for solving
classification problems
P ( A, C )
P (C | A) 
 Conditional Probability:
P( A)
P( A | C ) 

P ( A, C )
P(C )
Bayes theorem:
P( A | C ) P(C )
P(C | A) 
P( A)
Data Analysis
Example of Bayes Theorem
 Given:




A doctor knows that meningitis causes stiff neck 50% of
the time
Prior probability of any patient having meningitis is
1/50,000
Prior probability of any patient having stiff neck is 1/20
If a patient has stiff neck, what’s the probability
he/she has meningitis?
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) 

 0.0002
Data Analysis
P( S )
1 / 20
Bayesian Classifiers
 Consider
variables
 Given


each attribute and class label as random
a record with attributes (A1, A2,…,An)
Goal is to predict class C
Specifically, we want to find the value of C that
maximizes P(C| A1, A2,…,An )
 Can
we estimate P(C| A1, A2,…,An ) directly from
data?
Data Analysis
Bayesian Classifiers

Approach:
 compute the posterior probability P(C | A1, A2, …,
An) for all values of C using the Bayes theorem
P(C | A A  A ) 
1
2
n
P( A A  A | C ) P(C )
P( A A  A )
1
2
n
1

2
n

Choose value of C that maximizes
P(C | A1, A2, …, An)

Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
How to estimate P(A1, A2, …, An | C )?
Data Analysis
Naïve Bayes Classifier
 Assume
independence among attributes
Ai when class is given:
 P(A1, A2, …, An |C) = P(A1| Cj) P(A2|
Cj)… P(An| Cj)
 Can
estimate P(Ai| Cj) for all Ai and Cj.
 New
point is classified to Cj if P(Cj) 
P(Ai| Cj) is maximal.
Data Analysis
How tol Estimate
Probabilities
l
a
a
us
c
c
i
i
o
r
r
u
o
o
from
Data?
n
s  Class: P(C) = N /N
i
g
g
t
s
e
e
c
t
t
n
a
Tid
10
ca
ca
co
cl
Refund
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes

e.g., P(No) = 7/10,
P(Yes) = 3/10
 For
discrete attributes:
P(Ai | Ck) = |Aik|/ Nck
 where
|Aik| is number of
instances having
attribute Ai and belongs
to class Ck
 Examples:
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
How to Estimate Probabilities
from Data?

For continuous attributes:

Discretize the range into bins



Two-way split: (A < v) or (A > v)


one ordinal attribute per bin
violates independence assumption
choose only one of the two splits as new
attribute
Probability density estimation:
Assume attribute follows a normal distribution
Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
 Once probability distribution is known, can use
it to estimate the conditional probability P(Ai|c)


Data Analysis
a l Estimate
al
us
How oto
Probabilities
c
c
i
i
o
r
r
u
g
go
tin
ss
e
e
t
t
n
a
from
Data?
cl
ca
ca
co
Tid
Refund
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
 Normal
distribution:
P( Ai | c j ) 
 One

1
2
2
ij
e
( Ai  ij ) 2
2 ij2
for each (Ai,ci) pair
 For
(Income, Class=No):
 If Class=No


sample mean = 110
sample variance = 2975
10
P( Income  120 | No) 
1
e
2 (54.54)

(120 110 ) 2
2 ( 2975 )
 0.0072
Data Analysis
Example of Naïve Bayes
Classifier Given a Test Record:
X  (Refund  No, Married, Income  120K)
naive Bayes Classifier:
P(Refund=Yes|No) = 3/7
P(Refund=No|No) = 4/7
P(Refund=Yes|Yes) = 0
P(Refund=No|Yes) = 1
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7
P(Marital Status=Single|Yes) = 2/7
P(Marital Status=Divorced|Yes)=1/7
P(Marital Status=Married|Yes) = 0
For taxable income:
If class=No:
sample mean=110
sample variance=2975
If class=Yes:
sample mean=90
sample variance=25

P(X|Class=No) = P(Refund=No|Class=No)
 P(Married| Class=No)
 P(Income=120K| Class=No)
= 4/7  4/7  0.0072 = 0.0024

P(X|Class=Yes) = P(Refund=No| Class=Yes)
 P(Married|Class=Yes)
 P(Income=120K|Class=Yes)
= 1  0  1.2  10-9 = 0
Since P(X|No)P(No) > P(X|Yes)P(Yes)
Therefore P(No|X) > P(Yes|X)
=> Class = No
Data Analysis
Naïve Bayes Classifier
 If
one of the conditional probability is zero,
then the entire expression becomes zero
 Probability estimation:
N ic
Original : P( Ai | C ) 
Nc
c: number of classes
N ic  1
Laplace : P( Ai | C ) 
Nc  c
N ic  mp
m - estimate : P( Ai | C ) 
Nc  m
p: prior probability
m: parameter
Data Analysis
Example of Naïve Bayes
A: attributes
Classifier
Name
human
python
salmon
whale
frog
komodo
bat
pigeon
cat
leopard shark
turtle
penguin
porcupine
eel
salamander
gila monster
platypus
owl
dolphin
eagle
Give Birth
yes
Give Birth
yes
no
no
yes
no
no
yes
no
yes
yes
no
no
yes
no
no
no
no
no
yes
no
Can Fly
no
no
no
no
no
no
yes
yes
no
no
no
no
no
no
no
no
no
yes
no
yes
Can Fly
no
Live in Water Have Legs
no
no
yes
yes
sometimes
no
no
no
no
yes
sometimes
sometimes
no
yes
sometimes
no
no
no
yes
no
yes
no
no
no
yes
yes
yes
yes
yes
no
yes
yes
yes
no
yes
yes
yes
yes
no
yes
Live in Water Have Legs
yes
no
Class
mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals
Class
?
M: mammals
N: non-mammals
6 6 2 2
P( A | M )      0.06
7 7 7 7
1 10 3 4
P( A | N )      0.0042
13 13 13 13
7
P( A | M ) P( M )  0.06   0.021
20
13
P( A | N ) P( N )  0.004   0.0027
20
P(A|M)P(M) >
P(A|N)P(N)
=> Mammals
Data Mining Tasks...
 Association
Rule Discovery [Descriptive]
 Classification [Predictive]
 Regression [Predictive]
 Clustering [Descriptive]
Data Analysis
Regression



Predict a value of a given continuous valued
variable based on the values of other variables,
assuming a linear or nonlinear model of
dependency.
Greatly studied in statistics, neural network fields.
Examples:
 Predicting sales amounts of new product based
on advetising expenditure.
 Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
 Time series prediction of stock market indices.
Data Analysis
Regression
Data Analysis
Data Mining Tasks...
 Association
Rule Discovery [Descriptive]
 Classification [Predictive]
 Regression [Predictive]
 Clustering [Descriptive]
Data Analysis
Clustering Definition


A clustering is a set of clusters
Given a set of data points, each having a set
of attributes, and a similarity measure among
them, find clusters such that



Data points in one cluster are more similar to
one another.
Data points in separate clusters are less similar
to one another.
Similarity Measures:


Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
Data Analysis
Illustrating Clustering
Intracluster distances
are minimized
Intercluster distances
are maximized
Euclidean Distance Based Clustering in 3-D space.
Data Analysis
Cluster can be Ambiguous
How many clusters?
Six Clusters
Two Clusters
Four Clusters
Data Analysis
Clustering
Data Analysis
Types of Clusterings

Important distinction between hierarchical,
partitional and density sets of clusters

Partitional Clustering (K-Means)


Hierarchical clustering (Agglomerative)


A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset
A set of nested clusters organized as a hierarchical tree
Density clustering (DBSCAN)

Clusters are regarded as regions in the data space in
which the objects are dense, and which are separated
by regions of low object density (noise).
Data Analysis
Partitional Clustering
Original Points
A Partitional Clustering
Data Analysis
K-Means





Partitional clustering approach
Each cluster is associated with a centroid
(center point)
Each point is assigned to the cluster with the
closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
Data Analysis
K-Means

Initial centroids are often chosen randomly.





Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in
the cluster.
‘Closeness’ is usually measured by Euclidean
distance, cosine similarity, correlation, etc.
K-means will converge for common similarity
measures mentioned above.
Most of the convergence happens in the first few
iterations.

Often the stopping condition is changed to ‘Until relatively
few points change clusters’
Data Analysis
Importance of Choosing Initial Centroids
Iteration 6
1
2
3
4
5
3
2.5
2
y
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
Data Analysis
Importance of Choosing Initial Centroids
Iteration 1
Iteration 2
Iteration 3
2.5
2.5
2.5
2
2
2
1.5
1.5
1.5
y
3
y
3
y
3
1
1
1
0.5
0.5
0.5
0
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2
-1.5
-1
-0.5
x
0
0.5
1
1.5
2
-2
Iteration 4
Iteration 5
2.5
2
2
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
-0.5
0
x
0.5
1
1.5
2
0
0.5
1
1.5
2
1
1.5
2
y
2.5
y
2.5
y
3
-1
-0.5
Iteration 6
3
-1.5
-1
x
3
-2
-1.5
x
-2
-1.5
-1
-0.5
0
x
0.5
1
1.5
2
-2
-1.5
-1
-0.5
0
x
0.5
Hierarchical Clustering
 Produces
a set of nested clusters
organized as a hierarchical tree
 Can be visualized as a dendrogram

A tree like diagram that records the
sequences of merges or splits
5
6
0.2
4
3
4
2
0.15
5
2
0.1
1
0.05
3
0
1
3
2
5
4
1
6
Data Analysis
Hierarchical Clustering
p1
p3
p4
p2
p1 p2
Traditional Hierarchical
Clustering
p3 p4
Traditional Dendrogram
p1
p3
p4
p2
p1 p2
Non-traditional Hierarchical
Clustering
p3 p4
Non-traditional Dendrogram
DBSCAN
Original Points
Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
Data Analysis
Data Mining: Assesment
Data Analysis
Assesment
Algorithms
Supervised
Metrics
Validation
Algorithms
Unsupervised
Metrics
Data Analysis
Supervised validation alg.
 Resubstitution
Data Analysis
Supervised validation alg.
 Hold-out
Data Analysis
Supervised validation alg.
 N-fold
cross validation
Data Analysis
Supervised validation alg.
 Leave-one-out

(N max folds)
N-cross fold validation cuando N =
dim(Datos)
Data Analysis
Supervised validation alg.
 0.632
Bootstrap
Clasificación Supervisada
Supervised metrics
 Calibration


Distance between real class and predited class.
Continuous [0,∞)
 Discrimination


Probability of classification
Continuous [0,1]
 In
classification, we want to get the lowest
calibration possible and the highest
discrimination possible.
Data Analysis
Página 65
Supervised metrics

Example:



Real class: 1
Predicted class: 0.6 (using regression)
Discrimination: 1
supossing that
if Classpredicted > 0.5 then Classpredicted = 1

Calibration: 0.4 (1 - 0.6)
Data Analysis
Supervised metrics
 Accuracy
(well classified) [Discrimination]
 Log Likelihood [Calibration]
 AUC [Discrimination]
 Brier Score [Calibration + Discrimination]
…
Hosmer DW, Lemeshow S (2000) Applied logistic regression
2nd edn. Wiley, New York
Data Analysis
AUC
 Area

Under the ROC Curve
Continuous [0,1]
Data Analysis
Unsupervised validation
Data Analysis
Unsupervised alg.


Compactness, the members of each cluster should
be as close to each other as possible. A common
measure of compactness is the variance, which
should be minimized.
Separation, the clusters themselves should be widely
spaced. There are three common approaches
measuring the distance between two different
clusters:



Single linkage: It measures the distance between the
closest members of the clusters.
Complete linkage: It measures the distance between
the most distant members.
Comparison of centroids: It measures the distance
between the centers of the clusters.
MARIA HALKIDI, YANNIS BATISTAKIS and MICHALIS VAZIRGIANNIS
On Clustering Validation Techniques, Journal of IIS, 2001
Data Analysis
Measures of Cluster Validity

Numerical measures that are applied to judge various
aspects of cluster validity, are classified into the following
three types.

External Index: Used to measure the extent to which cluster
labels match externally supplied class labels.


Internal Index: Used to measure the goodness of a clustering
structure without respect to external information.


Entropy
Sum of Squared Error (SSE)
Relative Index: Used to compare two different clusters.

Often an external or internal index is used for this function, e.g., SSE or
entropy
MARIA HALKIDI, YANNIS BATISTAKIS and MICHALIS VAZIRGIANNIS
On Clustering Validation Techniques, Journal of IIS, 2001
Data Analysis
Using Similarity Matrix for Cluster Validation
Order the similarity matrix with respect to cluster
labels and inspect visually.
1
1
0.9
0.8
0.7
Points
0.6
y

0.5
0.4
0.3
0.2
0.1
0
10
0.9
20
0.8
30
0.7
40
0.6
50
0.5
60
0.4
70
0.3
80
0.2
90
0.1
100
0
0.2
0.4
0.6
0.8
1
20
x
40
60
80
0
100 Similarity
Points
Complete Link
Data Analysis
Using Similarity Matrix for Cluster
Validation
 Clusters
in random data are not so crisp
1
10
0.9
0.9
20
0.8
0.8
30
0.7
0.7
40
0.6
0.6
50
0.5
0.5
60
0.4
0.4
70
0.3
0.3
80
0.2
0.2
90
0.1
0.1
100
20
40
60
80
0
100 Similarity
y
Points
1
0
0
Points
0.2
0.4
0.6
0.8
1
x
Complete Link
Data Analysis
Using Similarity Matrix for Cluster
Validation
1
0.9
1
2
500
6
3
0.8
0.7
1000
4
0.6
1500
0.5
0.4
2000
0.3
5
0.2
2500
0.1
7
3000
500
1000
1500
2000
2500
3000
DBSCAN
Data Analysis
0
Data Analysis
Santiago González
<[email protected]>