Download PPT - Department of Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Cluster analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Principal component analysis wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
Medical Data Mining
Carlos Ordonez
University of Houston
Department of Computer Science
Outline
• Motivation
• Main data mining techniques:
– Constrained Association Rules
– OLAP Exploration and Analysis
• Other classical techniques:
– Linear Regression
– PCA
– Naïve Bayes
– K-Means
– Bayesian Classifier
2/45
Motivation: why inside a DBMS?
• DBMSs offer a level of security unavailable with
flat files.
• Databases have built-in features that optimize
extraction and simple analysis of datasets.
• We can increase the complexity of these analysis
methods while still keeping the benefits offered by
the DBMS.
• We can analyze large amounts of data in an
efficient manner.
3/45
Our approach
• Avoid exporting data outside the DBMS
• Exploit SQL and UDFs
• Accelerate computations with query optimization
and pushing processing into main memory
4/45
Constrained Association Rules
• Association rules – technique for identifying
patterns in datasets using confidence
• Looks for relationships between the variables
• Detects groups of items that frequently occur
together in a given dataset
• Rules are in the format X => Y
• The set of items X are often found in
conjunction with the set of items Y
5/45
The Constraints
• Group Constraint
• Determines which variables can occur together in the
final rules
• Item Constraint
• Determines which variables will be used in the study
• Allows the user to ignore some variables
• Antecedent / Consequent Constraint
• Determines the side of the rule that a variable can
appear on
6/45
Experiment
•
•
Input dataset: p=25,
n=655
Three types of
Attributes:
– P: perfusion
measurements
– R: risk factor
– D: heart disease
measurements
7/45
Experiments
• This table summarizes the impact of constraints on
number of patterns and running time.
8/45
Experiments
• This Figure shows rules predicting no heart
disease in groups.
9/45
Experiments
• This figure shows groups of rules predicting heart
disease.
10/45
Experiments
• These figures show some selected cover rules,
predicting absence or existence of disease.
11/45
OLAP Exploration and Analysis
• Definition:
– Input table F with n records
– Cube dimension: D={D1,D2,…Dd}
– Measure dimension: A={A1,A2,…Ae}
– In OLAP processing, the basic idea is to
compute aggregations on measure Ai by subsets
of dimensions G, GD.
12/45
OLAP Exploration and Analysis
• Example:
– Cube with three
dimensions (D1,D2,D3)
– Each face represents a
subcube on two
dimensions
– Each cell represent
subcube on one
dimension
13/45
OLAP Statistical Tests
• We proposed the use of statistical tests on pairs of
OLAP sub cubes to analyze their relationship
• Statistical Tests allow us to mathematically show
that a pair of sub cubes are significantly different
from each other
14/45
OLAP Statistical Tests
• The null hypothesis H0 states 1=2 and the goal is to find
groups where H0 can be rejected with high confidence 1-p.
• The so called alternative hypothesis H1 states 12 .
• We use a two-tailed test which allows finding a significant
difference on both tail of the Gaussian distribution in order
to compare means in any order (12 or 21).
• The test relied on the following equation to compute a
random variable z.
z


1
2
2
2


1 n
1
2n
2
15/45
Experiments
•
•
•
•
•
n = 655
d = 21
e=4
Includes patient information, habits,
and perfusion measurements as
dimensions
Measures are the stenosis, or amount
of narrowing, of the four main arteries
of the human heart
16/45
Experiment Evaluation
• Heart data set: Group pairs with significant
measure differences at p=0.01
17/45
Experiment Evaluation
• Summary of medical result at p=0.01
• The most important is OLDYN, SEX and
SMOKE.
18/45
Comparing Reliability of OLAP Statistical
Tests and Association Rules
• Both techniques altered to bring on same plane for
comparison
– Association Rules: added post process pairing
– OLAP Statistical Tests: added constraints
• Cases under study
– Association Rules (HH) – both rules have high
confidence
• AdmissionAfterOpen(1),
AorticDiagnosis(0/1)=>NetMargin(0/1)
• High confidence, but also high p-value
• Data is crowded around AR boundary point
19/45
Comparing Reliability of OLAP Statistical
Tests and Association Rules
• Association Rules: High/High
– We can see that the data is crowded around boundary point for
Association Rules
– Two Gaussians are not significantly different
– Conclude: both agree, OLAP Statistical Tests is more reliable
20/45
Comparing Reliability of OLAP Statistical
Tests and Association Rules
• Association Rules: Low/Low
– Once again boundary point comes into play
– Two Gaussians are not significantly different
– Conclude: both agree
21/45
Comparing Reliability of OLAP Statistical
Tests and Association Rules
• Association Rules: High/Low
– Ambiguous
22/45
Results from TMHS dataset
•
•
Mainly financial dataset
– Revolves around opening of a new medical center for treating heart patients
Results from Association Rules
– Found 4051 rules with confidence>=0.7 and support>=5%
– AfterOpen=1, Elder=1 => Low Charges
• After the center opened, the elderly enjoyed low charges
– AfterOpen=0, Elder=1 => High Charges
• Before the center opened, the elderly was associated with high charges
•
Results from OLAP Statistical Tests
– Found 1761 pairs with p-value<0.01 and support>=5%
– Walk-in, insurance (commercial/medicare) => charges(high/low)
• Amount of total charges to patient depends on his/her insurance when the admission
source is a walk-in
– AorticDiagnosis=0, AdmissionSource (Walk-in / Transfer) => lengthOfStay (low /
high)
• If diagnosis is not aortic disease, then the length of stay depends on how the patient was
admitted.
23/45
Machine Learning techniques
•
•
•
•
PCA
Regression: Linear and Logistic
Naïve Bayes
Bayesian classification
24/45
Principal Component Analysis
• Dimensionality reduction technique for highdimensional data (e.g. microarray data).
• Exploratory data analysis, by finding hidden
relationships between attributes.
Assumptions:
– Linearity of the data.
– Statistical importance of mean and covariance.
– Large variances have important dynamics.
25/45
Principal Component Analysis
•
Rotation of the input space to
eliminate redundancy.
•
Most variance is preserved.
•
Minimal correlation between
attributes.
•
UTX is a new rotated space.
•
Select the kth most representative
components of U. (k<d)
•
Solving PCA is equivalent to solve
SVD, defined by the eigen-problem:
U: left eigenvectors
E: the eigenvalues
V: the right eigenvectors
X=UEVT
XXT=UE2UT
26/45
PCA Example
U1
age
U2
U3
0.393
0.223
gender
-0.293
0.454
on_thyroxine
-0.161
U4
0.232
0.229
-0.100
-0.397
on_antithyroid_med
0.107
0.221
-0.175
sick
0.171
0.608
0.327
U7
U8
-0.259
0.195
-0.405
-0.226
-0.100
0.162
0.446
0.184
0.447
0.019
-0.204
0.131
0.138
0.208
-0.188
surgery
I131_treatment
U6
-0.413
query_thyroxine
pregnant
U5
-0.194
0.246
-0.108
query_hypothyroid
-0.276
-0.214
0.107
0.329
0.360
-0.059
-0.157
0.136
0.294
-0.573
-0.123
-0.129
0.189
query_hyperthyroid
-0.223
0.107
lithium
-0.134
0.159
0.421
0.217
0.247
0.319
0.216
goitre
-0.100
-0.174
0.166
-0.430
0.236
0.278
-0.178
tumor
0.384
-0.151
-0.108
0.109
-0.110
0.697
-0.156
0.195
hypopituitary
psych
0.118
-0.604
-0.230
0.459
-0.846
-0.523
-0.251
0.155
-0.155
-0.276
27/45
PCA Example
U1
U2
U3
U4
U5
age
0.102
chol
0.131
0.175
0.198
0.156
claudi
0.173
0.273
0.252
0.220
0.261
0.305
0.353
0.144
-0.420
0.266
-0.193
diab
-0.273
fhcad
-0.408
gender
-0.409
0.347
-0.106
0.379
hta
-0.128
-0.122
0.138
0.109
hyplpd
0.217
pangio
-0.103
0.183
0.195
-0.204
-0.111
-0.347
0.224
pcarsur
0.286
-0.318
pstroke
0.449
0.138
-0.157
smoke
0.159
-0.323
0.417
lad
0.371
0.504
lcx
0.572
-0.135
lm
-0.288
0.221
rca
0.184
-0.156
-0.103
-0.448
U7
U8
0.105
0.275
0.275
0.194
0.217
-0.239
-0.152
0.326
0.108
0.393
-0.105
-0.110
-0.154
-0.311
-0.415
-0.117
-0.217
0.263
0.370
0.152
0.123
-0.464
0.342
-0.170
-0.160
-0.516
0.448
0.422
0.160
0.205
0.294
0.301
-0.313
-0.409
0.210
0.142
0.301
-0.329
U6
-0.141
28/45
Linear Regression
•
There are two main applications for linear regression:

Prediction or forecasting of the output or variable of interest Y
• Fit a model from the observed Y and the input variables X.
• For values of X given without its accompanying value of Y,
the model can be used to make a prediction of the output of
interest Y.
•
Given an input data X={x1,x2,…,xn}, with d dimensions Xa, and the response
or variable of interest Y.
•
Linear regression finds a set of coefficients β to model:
Y = β0+β1X1+…+βdXd+ɛ.
29/45
Linear Regression with SSVS
•
Bayesian variable selection

Quantify the strength of the relationship between Y and a number of
explanatory variables Xa.
• Assess which Xa may have no relevant relationship with Y.
• Identify which subsets of the Xa contain redundant information about Y.
•
The goal is to find the subset of explanatory variables Xγ which best predicts
the output Y, with the regression model Y = βγ Xγ+ɛ.
•
We use Gibbs sampling, which is an MCMC algorithm, to estimate the
probability distribution π(γ|Y,X) of a model to fit the output variable Y.
•
Other techniques, like stepwise variable selection, perform a partial search to
find the model that better explains the output variable.
•
Stochastic Search Variable Selection finds best “likely” subset of variables
based on posterior probabilities.
30/45
Linear Regression in the DBMS
•
•
•
Bayesian variable selection is implemented completely inside the DBMS with
SQL and UDFs for efficient use of memory and processor resources.
Our algorithms and storage layouts for tables in the DBMS have a
representative impact on execution performance.
Compared to the statistical package R, our implementations scale to large data
sets.
31/45
Linear regression:
Experimental results
Variables Gamma
age
1
chol
2
claudi
3
diab
4
fhcad
5
gender
6
hta
7
hyplpd
8
pangio
9
pcarsur
10
pstroke
11
smoke
12
il
13
ap
14
al
15
la
16
as_
17
sa
18
li
19
si
20
is_
21
Parameters
Variables: 21
n = 655
Y: rca
c = 100
it = 10000
burn =1000
Parameters
Variables: 21
n = 655
Y: lad
c = 100
it = 10000
burn =1000
Gamma
Prob
rSquared
0,1,3,8,12,13,16,19 0.012333 0.826227
0,1,3,8,12,13
0.011778 0.838421
0,1,3,6,8,12,13
0.011556 0.832125
0,1,3,6,8,12,13,17
0.010333 0.826885
0,1,3,8,9,12,13,16,19 0.008889 0.821647
0,1,3,6,8,9,12,13
0.008 0.826993
0,1,3,8,12,13,17
0.007222 0.833006
0,1,3,6,8,13,17
0.006889 0.833852
0,1,3,6,8,9,13
0.006778 0.838573
0,1,3,6,8,9,12,13,17 0.006556 0.821839
Gamma
0,1,14,18
0,1,13,14,18
0,1,8,14,18
0,1,9,14,18
0,1,6,14,18
0,1,3,14,18
0,1,14,16,18
0,1,14,17,18
0,1,14,18,21
0,1,8,13,14,18
Prob
rSquared
0.061556 0.768594
0.028556
0.7652
0.022889 0.765396
0.014444 0.766478
0.013222 0.766782
0.011667 0.767118
0.010111 0.767645
0.01 0.767105
0.008667 0.768276
0.008333 0.762457
32/45
Linear regression:
Experimental results
Parameters
d(γ0)
1
dimensions
n
295
iterations
Cancer microarray data, where
gamma are the gene numbers.
Gamma
0,3,4,52,99,196,287,1833,1857,2115,2563,2601,3720,3924,4854,4879
0,3,4,52,99,196,287,1833,1857,2563,2601,3924,4854,4879
0,3,4,52,99,196,287,1833,1857,2115,2563,2601,3924,4854,4879
0,3,4,52,99,196,287,1833,3924,4854,4879
0,3,4,52,99,196,287,1833,2563,2601,3924,4854,4879
0,3,4,52,99,196,287,1833,4854
0,3,4,52,99,196,287,1833,4854,4879
0,3,4,52,99,196,287,1833,2601,3924,4854,4879
0,3,4,99,196,287,1833,4854
4918
1000
c
1
y
Cens
Probability
0.761239
0.108891
0.050949
0.041958
rSquared
0.00664
0.006756
0.006702
0.006771
0.027972
0.002997
0.001998
0.001998
0.000999
0.006758
0.006836
0.006776
0.006758
0.006924
33/45
Logistic Regression
Similar to linear regression. The data is fitted to a logistic curve. This
technique is used for the prediction of probability of occurrence of an
event.
P(Y=1|x) = π(x)
π(x) =1/(1+e-g(x)) , where g(x)= β0+β1X1+β2X2+…+βdXd
34/45
Logistic Regression:
Experimental results
med655
Train
• n = 491
• d = 15
• y = LAD>=70%
Model:
Name
Coefficient
Name
Intercept
-2.191237293 LI
AGE
0.035740648 LA
SEX
0.40150077 AP
HTA
0.279865571 AS_
DIAB
0.060630279 SA
CHOL
0.001882748 SI
SMOKE
0.31437235 IS_
AL
0.198138067 IL
Coefficient
-0.090759713
-0.210152957
0.600745945
0.264413463
0.342609744
0.04750216
-0.159692182
0.446180853
Accuracy
Test
• n = 164
med655
Global
Class-0
Class-1
70
74
67
35/45
Naïve Bayes (NB)
•
•
•
•
Naïve Bayes is one of the most popular classifiers
Easy to understand.
Produces a simple model structure.
It is robust and has a solid mathematical
background.
• Can be computed incrementally.
• Classification is achieved in linear time.
• However, it has an independence assumption.
36/45
Bayesian Classifier
• Why Bayesian:
– A Bayesian Classifier Based on Class Decomposition
Using EM Clustering.
– Robust models with good accuracy and low over-fit.
– Classifier adapted to skewed distributions and
overlapping set of data points by building local models
based on clusters.
– EM Algorithm used to fit the mixtures per class.
– Bayesian Classifier is composed of a mixture of k
distributions or clusters per class.
37/45
Bayesian Classifier
Based on K-Means (BKM)
• Motivation
– Bayesian Classifiers are accurate and efficient.
– A Generalization of the Naïve Bayes algorithm.
– Model accuracy can be tuned varying number of
clusters, setting class priors and making a probabilitybased decision.
– EM is a distance based clustering algorithm.
– Two phases involved in building the predictive model
• Building the predictive model.
• Scoring a new data set based on the computed predictive model.
38/45
Example
• Medical Dataset is used with 655 rows n with varying
number of clusters k.
• This Dataset has 25 dimensions d which includes diseases
to be predicted, risk factors and perfusion measurements.
• Dimensions having null values have been replaced with the
mean of that dimension.
• Here, we predict accuracy for LAD, RCA (2 diseases).
• Accuracy is good for maximum k = 8.
39/45
Example: medical
med655
• n = 655
• d = 15
• g= 0,1
• G represents if the patient developed
heart disease or not.
wbcancer
• n = 569
• d=7
• g= 0,1
• G represents if the cancer is benign
or malignant.
• Features describe the characteristics
of cell nuclei obtained from image
of breast mass.
Accuracy
med655
wbcancer
Global
Class-0
Class-1
NB
67
83
53
BKM
62
53
70
NB
93
91
95
BKM
93
84
97
40/45
BKM & NB Models
BKM: med655
g
j
0
0
0
0
1
1
1
1
AGE SEX HTA CHOL SMOKE
1 4.49
0 0.97
5.3 1.82
2 4.36 2.08 1.07 5.49 0.48
3 5.09 0.08 1.25 6.35 0.21
4 5.1 2.08 0.37 5.59 1.78
1 6.28 1.75 0.96 6.97 2.06
2 6.45 1.31 0.74 6.98
0
3 4.64 1.82 0.88 7.24 2.06
4 4.7 1.75 1.03 7.04
0
NB: med655
g
MEAN_VAR AGE
SEX HTA CHOL
SMOKE
0 MEAN
58.6 0.64 0.4 219.47
0.57
0 VAR
147.92 0.23 0.24 1497.45
0.25
1 MEAN
63.9 0.74 0.45 218.34
0.62
1 VAR
128.5 0.19 0.25 957.69
0.24
BKM: wbcancer
g
j
0
0
0
0
1
1
1
1
x3
1
2
3
4
1
2
3
4
x5
6.56
5.44
4.68
5.42
6.29
6.97
5.92
7.49
8.27
7.32
8.94
8.37
6.12
7.12
7.83
6.68
x12 x18 x26
2.1 2.97 2.8
2.02 2.07 1.63
2.18 2.46 3.12
4.18 3.89 1.79
2.12 0.96 1.06
2.16 3.07 3.59
2.45 1.9 1.74
1.48 1.49 2.02
NB: wbcancer
g
MEAN_VAR
0 MEAN
0 VAR
1 MEAN
1 VAR
x3
x5
x12 x18
x26
115.71 0.1 1.2
0.02 0.37
438.72
0 0.26 3.35E-05 0.03
78.18 0.09 1.22
0.01 0.18
136.45
0 0.37 3.58E-05 0.01
41/45
Cluster Means and Weights
•Means are assigned around the global mean based on Gaussian initialization.
•Table below shows means of clusters having 9 dimensions (d).
•The weight of a cluster is given by 1.0/k, where k is the number of clusters.
Class
Means
Weight
AGE
SEX
DIAB
HYPLPD
FHCAD
SMOKE
CHOL
LA
AP
0
60
0.721
0.209
0.209
0.116
0.698
185
-0.178
-0.331
0.0754
0
76.5
0.632
0.08
0.488
0.056
0.488
223
-0.225
-0.37
0.219
0
42.2
0.754
0.029
0.667
0.261
0.58
224
-0.505
-0.715
0.121
0
65.1
0.753
0.193
0.602
0.0904
0.566
223
-0.22
-0.375
0.291
0
56.5
0.652
0.261
0.217
0.261
0.565
139
-0.379
-0.527
0.0404
0
54.2
0.729
0.132
0.583
0.104
0.66
223
-0.26
-0.519
0.253
1
51.9
0.533
0.2
0.933
0.267
0.733
269
0.0233
-0.577
0.176
1
59.7
0.333
0.333
0.889
0
0.667
318
-0.494
-0.748
0.212
1
48
0.4
0.2
0.8
0.2
0.8
201
-0.68
-0.462
0.0588
1
67.1
0.444
0.222
0.889
0.111
0.593
252
-0.474
-0.645
0.318
1
53
0.5
0
1
0.5
0.75
456
-0.512
-1
0.0471
1
72.7
0.75
0.313
0.438
0
0.625
202
-0.782
-0.229
0.188
42/45
Prediction of Accuracy Varying k
(Same Clusters k per Class)
Dimensions = 21 (Perfusion
Measurements + Risk factors)
Accuracy for LAD
Accuracy for RCA
k=2
65.8%
66.5%
k=4
67.90%
68.82%
k=6
69.89%
70.42%
k=8
75.11%
72.67%
k = 10
68.35%
70.23%
Dimensions=9
(Perfusion Measurements)
Accuracy for LAD
Accuracy for RCA
k=2
73.13%
67.63%
k=4
73.37%
67.90%
k=6
74.80%
69.80%
k=8
77.07%
72.06%
k = 10
72.34%
68.93%
43/45
The DBMS Group
• Students:
– Zhibo Chen
– Carlos Garcia-Alvarado
– Mario Navas
– Sasi Kumar Pitchaimalai
– Ahmad Qwasmeh
– Rengan Xu
– Manish Limaye
44/45
Publications
1.
2.
3.
4.
5.
6.
7.
8.
Ordonez C., Chen Z., Evaluating Statistical Tests on OLAP Cubes to Compare Degree of Disease, IEEE
Transactions on Information Technology in Biomedicine 13(5): 756-765 (2009)
Chen Z., Ordonez C., Zhao K., Comparing Reliability of Association Rules and OLAP Statistical Tests. ICDM
Workshops 2008: 8-17
Ordonez, C., Zhao, K., A Comparison between Association Rules and Decision Trees to Predict Multiple Target
Attributes, Intelligent Data Analysis (IDA), to appear in 2011.
Navas, M., Ordonez, C., Baladandayuthapani, V., On the Computation of Stochastic Search Variable Selection in
Linear Regression with UDFs, IEEE ICDM Conference, 2010
Navas, M., Ordonez, C., Baladandayuthapani, V., Fast PCA and Bayesian Variable Selection for Large Data Sets
Based on SQL and UDFs, Proc. ACM KDD Workshop on Large-scale Data Mining: Theory and Applications
(LDMTA), 2010
Ordonez C., Pitchaimalai, S.K. Bayesian Classifiers Programmed in SQL, IEEE Transactions on Knowledge and
Data Engineering (TKDE) 22(1): 139-144 (2010)
Pitchaimalai, S.K., Ordonez, C., Garcia-Alvarado, C., Comparing SQL and MapReduce to compute Naive Bayes
in a Single Table Scan, Proc. ACM CIKM Workshop on Cloud Data Management (CloudDB), 2010
Navas M., Ordonez C., Efficient computation of PCA with SVD in SQL. KDD Workshop on Data Mining using
Matrices and Tensors 2009
45/45