Download model - Biomedical Informatics Group

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Sistemas de Información Biomédica
Grado en Ingeniería Biomédica
Universidad Politécnica de Madrid
KDD & Data Mining
Prof. David Pérez del Rey
[email protected]
School of Computer Science - UPM
Room 2104
Tel: +34 91 336 74 45
Outline – 2 + 1 hours

KDD and DM – 1 hour (first day)
 KDD and Data Mining
 Simple Examples
 Biomedical Applications
 Further resources

DM Exercises and Assignment selection – 1 hour (first day)
 Groups
 Assignment
 Start working…

Presentations – up to 1 hours (in 2 weeks)
 A 10-15 minutes presentation per group
2
Motivation – Data growth

Nowadays the amount
of data available is
increasing dramatically:
 Bank, telecom…
 Astronomy, biology,
medicine…
 Web, text, ecommerce…

Few data is ever
looked by humans
3
Motivation – Data growth in Biomedicine
David Pérez del Rey
4
More information = better decisions?

More data available doesn’t mean more
knowledge to be used in decisions

We need automatic Knowledge Discovery in
Databases (KDD) methods
5
What is (not) Data Mining?
What is not Data
Mining?


What is Data Mining?
– Look up phone
number in phone
directory
– Certain names are more
prevalent in certain US
locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web
search engine for
information
about “Amazon”
– Group together similar
documents returned by
search engine according to
their context (e.g. Amazon
rainforest, Amazon.com,)
© Tan,Steinbach, Kumar
Introduction to Data Mining
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
 valid
 novel
 potentially useful
 and ultimately understandable patterns in data
from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
7
Phases of the KDD Process (Fayyad et al., 1996)
“Discovery process of non-trivial and useful knowledge”
Integrated
Data
Original
Data
Target
Data
Selection
Data Reduction
Integration
Interpretation
Knowledge
Data Cleaning
Data Mining
Patterns
Transformation
Transformed
8
Data
Data Mining Tasks
 Prediction
Methods
 Use some variables to predict unknown or
future values of other variables
 Description
Methods
 Find human-interpretable patterns that
describe the data
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Related Fields
Machine
Learning
Visualization
Data Mining and
Knowledge
Discovery
Statistics
Databases
&
Integration
10
BIG Data!!!

Big data = collection of data sets so large and
complex that it becomes difficult to process using
traditional data processing applications










Analysis
Capture
Curation
Search
Sharing
Storage
Transfer
Visualization
Privacy violations
…

Traditional
technologies have
limitations in:







Internet search
Physics simulations
Meteorology
Genomics
Internet search
Finance
…
11
Types of Attributes

There are different types of attributes
 Nominal
○
Examples: ID numbers, eye color, zip codes
 Ordinal
○ Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium,
short}
 Interval
○ Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
 Ratio
○ Examples: temperature in Kelvin, length, time, counts
© Tan,Steinbach, Kumar
Introduction to Data Mining
Properties of Attribute Values

The type of an attribute depends on which of the
following properties it possesses:
 Distinctness:
 Order:
 Addition:
 Multiplication:
= 
< >
+ */
 Nominal attribute: distinctness
 Ordinal attribute: distinctness & order
 Interval attribute: distinctness, order & addition
 Ratio attribute: all 4 properties
© Tan,Steinbach, Kumar
Introduction to Data Mining
Discrete and Continuous Attributes

Discrete Attribute
 Has only a finite or countably infinite set of values
 Examples: zip codes, counts, or the set of words in a
collection of documents
 Often represented as integer variables
 Note: binary attributes are a special case of discrete
attributes

Continuous Attribute
 Has real numbers as attribute values
 Examples: temperature, height, or weight
 Practically, real values can only be measured and
represented using a finite number of digits
 Continuous attributes are typically represented as floatingpoint variables
© Tan,Steinbach, Kumar
Introduction to Data Mining
Missing Values

Reasons for missing values
 Information is not collected
(e.g., people decline to give their age and weight)
 Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

Handling missing values




Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their
probabilities)
© Tan,Steinbach, Kumar
Introduction to Data Mining
Duplicate Data

Data set may include data objects that are
duplicates, or almost duplicates of one another
 Major issue when merging data from heterogeous sources

Examples:
 Same person with multiple email addresses

Data cleaning
 Process of dealing with duplicate data issues
© Tan,Steinbach, Kumar
Introduction to Data Mining
Previous Phases to Data Mining

High-quality data preparation is key to producing
valid and reliable models









Data Understanding
Integration
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
…
17
Data mining

Unsupervised Techniques
 Cluster Analysis, Principal Components
 Association Rules, Collaborative Filtering

Supervised Techniques
 Prediction (Estimation):
○ Regression, Regression Trees, k-Nearest Neighbors
 Classification:
○ k-Nearest Neighbors, Naïve Bayes, Classification
Trees, Logistic Regression, Neural Nets
18
Clustering Definition

Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
 Data points in one cluster are more similar to one another
 Data points in separate clusters are less similar to one
another

Similarity Measures:
 Euclidean Distance if attributes are continuous.
 Other Problem-specific Measures.
© Tan,Steinbach, Kumar
Introduction to Data Mining
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distances
are minimized
© Tan,Steinbach, Kumar
Intercluster distances
are maximized
Introduction to Data Mining
Clustering: Application 1

Market Segmentation:
 Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix
 Approach:
○ Collect different attributes of customers based on their
geographical and lifestyle related information
○ Find clusters of similar customers
○ Measure the clustering quality by observing buying patterns
of customers in same cluster vs. those from different clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
Regression
Predict a value of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency
 Greatly studied in statistics, neural network fields
 Examples:
 Predicting sales amounts of new product based on
advertising expenditure
 Predicting wind velocities as a function of
temperature, humidity, air pressure, etc
 Time series prediction of stock market indices

© Tan,Steinbach, Kumar
Introduction to Data Mining
Data Mining Tasks: Regression
y
Y1
y=x+1
Y1’
X1
x
Classification: Definition

Given a collection of records (training set )
 Each record contains a set of attributes, one of the
attributes is the class

Find a model for class attribute as a function of the
values of other attributes

Goal: previously unseen records should be assigned a
class as accurately as possible
 A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it
© Tan,Steinbach, Kumar
Introduction to Data Mining
Examples of Classification Task

Predicting relapse of cancer

Classifying credit card transactions
as legitimate or fraudulent

Classifying structures of protein

Categorizing news stories as finance,
weather, entertainment, sports, etc…
Classification: Application

Direct Marketing
 Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product
 Approach:
○ Use the data for a similar product introduced before
○ We know which customers decided to buy and which
decided otherwise - This {buy, don’t buy} decision forms
the class attribute.
○ Collect various demographic, lifestyle, and companyinteraction related information about all such customers
 Type of business, where they stay, how much they earn, etc.
○ Use this information as input attributes to learn a classifier
model
From [Berry & Linoff] Data Mining Techniques, 1997
Classification: Application

Sky Survey Cataloging
 Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the
telescopic survey images (from Palomar
Observatory).
 3000 images with 23,040 x 23,040 pixels per image.
 Approach:
○ Segment the image
○ Measure image attributes (features) - 40 of them per object
○ Model the class based on these features
○ Success Story: Could find 16 new high red-shift quasars,
some of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Classifying Galaxies
Early
Class:
• Stages of Formation
Courtesy: http://aps.umn.edu
Attributes:
• Image features,
• Characteristics of light
waves received, etc.
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Classification: Application

Fraud Detection
 Goal: Predict fraudulent cases in credit card
transactions
 Approach:
○ Use credit card transactions and the information on its
account-holder as attributes
 When does a customer buy, what does he buy, how often he pays
on time, etc
○ Label past transactions as fraud or fair transactions - This
forms the class attribute
○ Learn a model for the class of the transactions.
○ Use this model to detect fraud by observing credit card
transactions on an account
© Tan,Steinbach, Kumar
Introduction to Data Mining
Illustrating Classification Task
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
Class
Learning
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Deduction
10
Test Set
© Tan,Steinbach, Kumar
Introduction to Data Mining
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
10
Training Data
Married
Model: Decision Tree
Another Example of Decision
Tree
MarSt
10
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that
fits the same data!
Decision Tree Classification Task
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
Class
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Test Set
Attrib3
Apply
Model
Class
Deduction
Decision
Tree
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Assign Cheat to “No”
Decision Tree Classification Task
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Test Set
Attrib3
Apply
Model
Class
Deduction
Decision
Tree
Decision Tree Induction

Many Algorithms:
 Hunt’s Algorithm (one of the earliest)
 CART
 ID3, C4.5
 SLIQ,SPRINT
Simple Examples
From Witten, I.H., and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
42
Simple Examples
From Witten, I.H., and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
43
Simple Examples
From Witten, I.H., and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
44
Simple Examples
From Witten, I.H., and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
45
Simple Examples
From Witten, I.H., and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
46
Decision tree Example
Let a computer do it for us: WEKA
From Witten, I.H., and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations
47
Model Evaluation

Metrics for Performance Evaluation
 How to evaluate the performance of a model?

Methods for Performance Evaluation
 How to obtain reliable estimates?

Methods for Model Comparison
 How to compare the relative performance among
competing models?
© Tan,Steinbach, Kumar
Introduction to Data Mining
Metrics for Performance Evaluation

Focus on the predictive capability of a model
 Rather than how fast it takes to classify or build models,
scalability, etc.

Confusion Matrix:
PREDICTED CLASS
Class=Yes
Class=Yes
ACTUAL
CLASS Class=No
a
c
Class=No
b
d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes
Class=Yes
ACTUAL
CLASS Class=No

Class=No
a
(TP)
b
(FN)
c
(FP)
d
(TN)
Most widely-used metric:
ad
TP  TN
Accuracy 

a  b  c  d TP  TN  FP  FN
Evaluation
51
Limitation of Accuracy

Consider a 2-class problem (e.g. Ebola or not)
 Number of Class 0 examples = 999
 Number of Class 1 examples = 1

If model predicts everything to be class 0,
accuracy is 999/1000 = 99.9 %
 Accuracy is misleading because model does not detect
any class 1 example
Cost Matrix
PREDICTED CLASS
C(i|j)
Class=Yes
Class=Yes
C(Yes|Yes)
C(No|Yes)
C(Yes|No)
C(No|No)
ACTUAL
CLASS Class=No
Class=No
C(i|j): Cost of misclassifying class j example as class i
Computing Cost of Classification
Cost
Matrix
PREDICTED CLASS
ACTUAL
CLASS
Model
M1
ACTUAL
CLASS
PREDICTED CLASS
+
-
+
150
40
-
60
250
Accuracy = 80%
Cost = 3910
C(i|j)
+
-
+
-1
100
-
1
0
Model
M2
ACTUAL
CLASS
PREDICTED CLASS
+
-
+
250
45
-
5
200
Accuracy = 90%
Cost = 4255
Cost vs Accuracy
Count
PREDICTED CLASS
Class=Yes
Class=Yes
ACTUAL
CLASS
Class=No
Class=No
a
b
c
d
Accuracy is proportional to cost if
1. C(Yes|No)=C(No|Yes) = q
2. C(Yes|Yes)=C(No|No) = p
N=a+b+c+d
Accuracy = (a + d)/N
Cost
PREDICTED CLASS
Class=Yes
ACTUAL
CLASS
Class=No
Class=Yes
p
q
Class=No
q
p
Cost = p (a + d) + q (b + c)
= p (a + d) + q (N – a – d)
= q N – (q – p)(a + d)
= N [q – (q-p)  Accuracy]
Cost-Sensitive Measures
a
Precision (p) 
ac
a
Recall (r) 
ab
2rp
2a
F - measure (F) 

r  p 2a  b  c
wa  w d
Weighted Accuracy 
wa  wb wc  w d
1
1
4
2
3
4
Methods for Performance Evaluation

How to obtain a reliable estimate of performance?

Performance of a model may depend on other
factors besides the learning algorithm:
 Class distribution
 Cost of misclassification
 Size of training and test sets
Learning Curve

Learning curve shows
how accuracy changes
with varying sample size
Effect of small sample size:
-
Bias in the estimate
-
Variance of estimate
Methods of Estimation





Holdout
 Reserve 2/3 for training and 1/3 for testing
Random subsampling
 Repeated holdout
Cross validation
 Partition data into k disjoint subsets
 k-fold: train on k-1 partitions, test on the remaining one
 Leave-one-out: k=n
Stratified sampling
 oversampling vs undersampling
Bootstrap
 Sampling with replacement
ROC (Receiver Operating Characteristic)

Developed in 1950s for signal detection theory to
analyze noisy signals
 Characterize the trade-off between positive hits and false
alarms
ROC curve plots TP (on the y-axis) against FP (on
the x-axis)
 Performance of each classifier represented as a
point on the ROC curve

 changing the threshold of algorithm, sample distribution or
cost matrix changes the location of the point
ROC Curve
(TP,FP):
 (0,0): declare everything
to be negative class
 (1,1): declare everything
to be positive class
 (1,0): ideal

Diagonal line:
 Random guessing
 Below diagonal line:
○ prediction is opposite of
the true class
ROC Curve
62
Using ROC for Model Comparison

No model consistently
outperform the other
 M1 is better for
small FPR
 M2 is better for
large FPR

Area Under the ROC
curve

Ideal:
 Area

=1
Random guess:
 Area
= 0.5
Data Mining in Biomedicine

Health Care






Disease diagnosis
Drug discovery
Symptom clustering
Decision Support Systems
…
Bioinformatics / Genomics
 Gene expression
 Microarrays analysis
○ Many columns (variables) – Moderate number of rows
(observation units)
 Protein structure prediction
 …

Major challenge: Integration of multi-scale data
64
Example: ALL/AML data



38 training cases, 34 test, ~ 7,000 genes
2 Classes: Acute Lymphoblastic Leukemia (ALL) vs
Acute Myeloid Leukemia (AML)
Use trainning data to build diagnostic model
ALL
AML
Results on test data:
33/34 correct, 1 error may be mislabeled
65
Protein Structure

SPIDER Data Mining Project: Scalable, Parallel
and Interactive Data Mining and Exploration at RPI
 http://www.cs.rpi.edu/~zaki
66
From http://www.cs.rpi.edu/~zaki
Resources

References
 Witten, I.H., and Frank, E., Data Mining: Practical Machine Learning Tools
and Techniques with Java Implementations, Morgan Kaufmann, 2000.
 Han, J. and Kamber, Micheline, Data Mining: Concepts and Techniques,
Morgan Kaufmann, 2001
 Pardalos P, Boginski V, Vazakopoulos A: Data mining in biomedicine.
Springer; 2007. (Google Books)
 Wang JTL, Zaki MJ, Toivonen HTT, et al. (eds). Data Mining in Bioinformatics.
Springer-Verlag, 2004. (Google Books)

Online
 University of Minnesota: http://www-
users.cs.umn.edu/~kumar/dmbook/index.php
 University of Regina:
http://www2.cs.uregina.ca/~hamilton/courses/831/index.html
 University of Waikato: Weka Software
 University of Ljubljana: Orange Software
68
Assignment – Weka / Orange

Using the Weka http://www.cs.waikato.ac.nz/ml/weka/ or Orange http://orange.biolab.si/
framework

Data mining analysis, comparing performance results for a classification problem:







Data mining analysis, comparing performance results for a regresion problem:




breastTumor.arff – Train a model to predict tumor size
At most 3 different classifiers for each dataset (including ZeroR and Neural Networks)
(optional) Investigate Evaluation measures
Report


breast-cancer.arff – Train a model to predict recurrence of cancer
At most 3 different classifiers for each dataset (including ZeroR and J48)
“cross-validation” vs “only training set”
(optional) Manual missing value preprocessing (Delete or estimate)
(optional) Manual field selection
(optional) Cost estimate
Length: up to 2-3 pages in .pdf to [email protected] by 27th October at 12:00
10-15 minutes presentation and discussion on 29nd October at 17:30
69