Download Datamining: Discovering Information From Bio-Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining:
Discovering Information
From Bio-Data
Present by:
Hongli Li & Nianya Liu
University of Massachusetts Lowell
Introduction

Data Mining Background
– Process
– Functionalities
– Techniques

Two examples
– Short Peptide
– Clinical Records

Conclusion
Data Mining Background Process
Data Collection &
Selection
Data Cleaning
Data Enrichment
Representations
Data Mining
Encoding
Functionalities





Classification
Cluster Analysis
Outlier Analysis
Trend Analysis
Association Analysis
Techniques





Decision Tree
Bayesian Classification
Hidden Markov Models
Support Vector Machines
Artificial Neural Networks
Technique 1
– Decision Tree
A1 > a
s
Ye
No
Class 2
Yes
Class 3
No
Class 1
A2 > c
No
Ye
s
A2 >b
Class 4
Technique 2
– Bayesian Classification

Based on Bayes Theorem
P(H | X )  P( X | H )  P(H ) P( X )
– Simple but comparable to Decision Trees
and Neural Networks Classifier in many
applications.
Technique 3
– Hidden Markov Model
Start
End
y
a
c
c
Technique 4
– Support Vector Machine

SVM find the maximum margin hyperplane
that separate the classis
– Hyperplane can be represented as a linear
combination of training points
– The algorithm that finds a separating hyperplane
in the feature space can be stated entirely in
terms of vectors in the input space and dot
products in the feature space
– Locate a separating hyperplane in the feature
space and classify points in that space simply by
defining a kernel function
Example 1
– Short Peptides

Problem
– Identify T-cell epitopes from Melanoma
antigens
– Training Set:
602 HLA-DR4 binding peptides
 713 non-binding


Solution – Neural Networks
Neural Networks
– Single Computing Element
x1
1
W
W2
f(net)
y
W
3
x2
x3
1
y  f (net ) 
where net   xi wi
 net
i
1 e
Neural Networks
Classifier

X1
Sparse Coding
– Alanine
10000000000000000000
X2
X3
Y
X4
X5
Xn-1
Xn

9 x 20 = 180 bits
per Inputs
Neural Networks
– Error Back-Propagation


x1
v1,1
w1
,2
v1
x2
v3
,1
,1
v2
v2,
2
y
w2
Adjustment
E
w j  
  ()( y)(1  y)( z j )
 wj
vij   (w j )( z j )(1  z j )( xi )
2
v3,
x3
y  f ( w j f ( xi vij ))
j
(t  y)2
Squared error: E 
2
Where  is a fixed leaning rate
i
Where z j is the output of the computing j element of the first layer
And  is the difference between the output y and correct output t.
th
Result & Remarks




Success Rate: 60%
A systematic experimental study is
very expensive
Highly accurate predicting method can
reduce the cost
Other alternatives exist
Datamining: Discovering Information
A Clinical Records
Problem

Problem : already known data (clinical records)
 predict unknown data
How to analysis known data ?
--- training data
How to test unknown data?
--- Predict data
Problem

The data has many attributes.
Ex: Having 2300 combinations of
attributes with 8 attributes for one
class.
It is impossible to calculate all
manually
Problem
 One Example:
Eight attributes for diabetic patients:
(1)Number of times pregnant
(2)Plasma glucose
(3)Diastolic blood pressure
(4)Triceps skin fold thickness
(5)Two-hour serum insulin
(6)Body mass index
(7)Diabetes pedigree
(8)Age
CAEP-Classification by aggregating
emerging patterns
A classification (known data) and
prediction (unknown data)
algorithms.
CAEP-Classification by
aggregating emerging patterns
 Definition:
(1)Training data
(2)Training data
Discovery all the emerging patterns.
Sum and normalize the differentiating
weight of these emerging patterns
(3)Training data 
Chooses the class with the largest
normalized score as the winner.
(4)Test data  Computing the score of test data and
making a Prediction
CAEP : Emerging Pattern

Emerging Pattern
Definition:
An emerging pattern is a pattern with some
attributes whose frequency increases significantly
from one class to another.
EX:
Mushroom
Poisonous
Edible
Smell
odor
None
Surface
Wrinkle
smooth
Ring-number 1
3
CAEP : Classification
Classification:
Definition:
(1) Discover the factors that differentiate the two groups
(2) Find a way to use these factors to predict to Which group a
new patient should belong.
CAEP : Method
Method:
 Discretize of the dataset into a binary one.
item (attribute , interval) Ex:( age, >45)
instance : a set of items such that an item
(A,v) is in t if only if the value of the
attribute A of t is within the interval
Clinical Record:
768 women
21% diabetic instances : 161
71% non-diabetics instances: 546
CAEP: Support
Support of X (attribute)
Definition: the ratio of number of items has this attribute
over the number of total items in this class.
| {t  D | X  t} |
|D|
Formula: suppD(x)=
Meaning: If supp(x) is high which means attribute x
exist in many items in this class.
Example : How many people in diabetic class are
older than 60? (attribute : >60)
148/161 =91%
CAEP: Growth
The growth rate of X (attribute)
Definition: The support comparison of same attributes
from two classes.
Formula: growD(x)= suppD(x) / suppD’(x)
Meaning: If grow(x) is high which means more possibility
of attribute X exist in class D than in class D’
Example: the patient older >60 in diabetic class is 91%
the people older >60 in non-diabetic class is 10%
growth(>60)= 91% / 10% = 9
CAEP: Likelihood
LikelihoodD(x)
Definition: the ratio of total number of items with attribute x
in one class to the total number of items with
attribute x in both two classes.
Formula1: LikelihoodD(x)= suppD (x) * |D|_______________
suppD (x) *|D| + suppD’ (x) *|D’|
Formula2: If D and D’ are roughly equal in size:
LikelihoodD(x)= suppD (x) ____________
suppD (x) + suppD’ (x)
91% * 223___________ =
203 = 78.99%
91% *223 + 10% * 545
257
Example: 91% _______ = 91% = 90.10%
91% + 10%
101%
Example:
CAEP: Evaluation
Sensitivity: the ratio of the number of correctly
predicted diabetic instances to the number
of diabetic instances.
Example: 60correctly predicted /100diabetic=60%
Specificity: the ratio of the number of correctly
predicted diabetic instance to the number
of predicted.
Example: 60correctly predicted /120predicted=50%
Accuracy: the percentage of instances correctly classified.
Example: 60correctly predicted /180 =33%
CAEP: Evaluation
 Using one attribute for class prediction
High accuracy:
Low sensitivity: only identify 30%
CAEP: Prediction
 Consider all attributes:
The accumulation of scores of all features it
has for class D
Formular: Score(t,D) =X likelihoodD(X)*suppD(x)
 Prediction:
Score(t,D)>score(t,D’)  t belongs to D class.
CAEP: Normalize




If the numbers of emerging patterns are different
significantly. One class D has more emerging
patterns than another class D’
The score of one instance of D has higher score
than the instance of D’
Score(t,D) = likelihoodD(X)*suppD(x)
Normalize the score
norm_score(t,D)=score(t,D) / base_score(D)
Prediction:
If norm_score(t,D)> norm_score(t,D’) 
t belongs to D class.
CAEP: Comparison
C4.5 and CBA
Sensitivity
Specificity
Accuracy
Diabetic/non Diabetic/non
-diabetic
-diabetic
C4.5
71.1%
CBA
73.0%
CAEP
70.5%/
63.3%
77.4%/
83%.1
75%
CAEP: Modify
 Problem: CAEP produces a very large
number
of emerging patterns.
Example: with 8 attribute, 2300 emerging
patterns.
CAEP: Modify

Reduce emerging patterns numbers
Method: Prefer strong emerging patterns over their
weaker relatives
Example: X1 with infinite growth,very small support
X2 with less growth, much larger support,
say 30 times than X2
In such case X2 is preferred because it
covers many more cases than X1.

There is no lose in prediction performance using
reduction of emerging patterns
CAEP: Variations
 JEP:
using exclusively emerging patterns whose supports increase
from zero to nonzero, are called jump.
Perform well when there are many jump emerging patterns
 DeEP:
It has more training phases is customized for that instance
Slightly better , incorporate new training data easily.
Relevance analysis
 Datamining algorithms
are in general
exponential in complexity
 Relevance analysis :
exclude the attributes that do not
contribute to the classification process
 Deal with much higher dimension datasets
 Not always useful for lower ranking
dimensions.
Conclusion

Classification and prediction aspect of datamining

Method includes decision trees, mathematical
formula, artificial neural networks, or emerging
patterns.

They are applicable in a large variety of
classification applications

CAEP has good predictive accuracy on all data sets.
.