Download R package: mlbench: Machine Learning Benchmark Problems

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Principal component analysis wikipedia , lookup

Transcript
Machine Learning
CPBS7711
Oct 2, 2014
Sonia Leach, PhD
Assistant Professor
Center for Genes, Environment, and Health
National Jewish Health
[email protected]
Center for Genes, Environment, and Health
Someone once said
“Artificial Intelligence = Search”
so Machine Learning = ?Induction of New Knowledge from
experience and ability to improve?
Machine Learning is a natural outgrowth of the intersection of Computer Science and Statistics.
We might say the defining question of Computer Science is “How can we build machines that
solve problems, and which problems are inherently tractable/intractable?”
The question that largely defines Statistics is “What can be inferred from data plus a set of
modeling assumptions, with what reliability?”
The defining question for Machine Learning builds on both, but it is a distinct question.
Whereas Computer Science has focused primarily on how to manually program computers,
Machine Learning focuses on the question of how to get computers to program themselves
(from experience plus some initial structure).
Whereas Statistics has focused primarily on what conclusions can be inferred from data,
Machine Learning incorporates additional questions about what computational architectures
and algorithms can be used to most effectively capture, store, index, retrieve and merge these
data, how multiple learning subtasks can be orchestrated in a larger system, and questions of
computational tractability.
We say that a machine learns with respect to a particular task T, performance metric P, and type
of experience E, if the system reliably improves its performance P at task T, following
experience E.
- Tom Mitchell http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf
Also interesting discussion of differences among AI, ML, Data Mining, Stats :
http://stats.stackexchange.com/questions/5026/what-is-the-difference-between-data-mining-statistics-machine-learning-and-ai
Center for Genes, Environment, and Health
2
Machine Learning
• From Wikipedia:
–
–
–
–
–
–
–
–
–
–
–
7.1 Decision tree learning
7.2 Association rule learning
7.3 Artificial neural networks
7.4 Inductive logic programming
7.5 Support vector machines
7.6 Clustering
7.7 Bayesian networks
7.8 Reinforcement learning
7.9 Representation learning
7.10 Similarity and metric learning
7.11 Sparse Dictionary Learning
• From Alppaydin Intro to Mach Learn:
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Supervised Learning
Bayesian Decision Theory
Parametric Methods
Multivariate Methods
Dimensionality Reduction
Clustering
Nonparametric Methods
Decision Trees
Linear Discrimination
Multilayer Perceptrons
Local Models
Kernel Machines
Bayesian Estimation
Hidden Markov Models
Graphical Models
Combining Multiple Learners
Reinforcement Learning
http://www.realtechsupport.org/UB/MRIII/papers/MachineLearning/Alppaydin_MachineLearning_2010.pdf
Center for Genes, Environment, and Health
3
Machine Learning (what I will cover)
• Unsupervised
• Supervised
–
–
–
–
–
– Dimensionality Reduction
• PCA
– Clustering
• k-Means, SOM, Hierarchical
– Association Set Mining
– Probabilistic Graphical
Models
k-Nearest Neighbor
Neural Nets
Decision Trees/Random Forests
SVMs
Naïve Bayes
• Issues
• HMMs, Bayes Nets
Connections to other lectures: Miller (HMM), Pollock (HMM),
Leach (HMM), Lozupone (PCA, Feature Importance Scores,
Clustering), Kechris (Regression), [Hunter (KnowledgeBased Analysis), Cohen (BioNLP), Phang (Expr Analysis)
….]
R: http://cran.r-project.org/web/views/MachineLearning.html
Center for Genes, Environment, and Health
–
–
–
–
–
–
Regression/Classification
Feature selection/reduction
Missing data
Boosting/bagging/jackknife
Cross validation, generalization
Model selection
4
Unsupervised Learning
Center for Genes, Environment, and Health
5
Dimensionality Reduction:
Principal Components Analysis (PCA)
• Motivation: Instead of considering all variables, use
small number of linear combos of those variables with
minimum information lost
http://blog.peltarion.com/2006/06/20/the-talented-drhebb-part-2-pca/
http://blog.peltarion.com/2006/06/20/the-talented-drhebb-part-2-pca/
2D data: What if could only choose
1 of the variables
to represent data?
Choose
y-axis,
explains
more
variance
in data
Amount of
variance
explained by P1
>
explained by Y
P
1
v
a
r
Amount of variance
explained by
single variable
Center for Genes, Environment, and Health
6
Principal Components Analysis (PCA)
• If X=(x1,x2,…,xn) is a random vector (mean vector , covariance
matrix ), then principal component transformation
X  Y = (X-)
s.t.  is orthogonal, T   =  is diagonal, 1  2  … p  0.
– Linear orthogonal transform of original data to new coordinate
system
– each component is linear combination of original variables
• coefficient of variables in linear combo = Loadings
• data transformed to new coords = Scores
– components ordered by percentage of variance explained along
new axis
– number of components = minimum dimension of input data matrix
– set of orthogonal vectors not unique, not scale-invariant (covariance
vs correlation), computed by eigen value decomposition (as above &
R princomp) or singular value decomposition (SVD) (R prncmp)
Center for Genes, Environment, and Health
Adapted from S-plus Guide to Statistics
7
Principal Components Analysis (PCA)
• If X is a random vector (mean , covariance matrix ), then
principal component transformation X  Y=(X-) s.t.  is
orthogonal, T   =  is diagonal, 1  2  … p  0.
1
2
3
4
5
6
7
8
9
10
diffgeom complex algebra reals stats
36
58
43
36
37
62
54
50
46
52
31
42
41
40
29
76
78
69
66
81
46
56
52
56
40
12
42
38
38
28
39
46
51
54
41
30
51
54
52
32
22
32
43
28
22
9
40
47
30
24
X
What if we could only
choose two dimensions?
Center for Genes, Environment, and Health
Adapted from S-plus Guide to Statistics
8
Principal Components Analysis (PCA)
• If X is a random vector (mean , covariance matrix ), then
principal component transformation X  Y=(X-) s.t.  is
orthogonal, T   =  is diagonal, 1  2  … p  0.
1
2
3
4
5
6
7
8
9
10
diffgeom complex algebra reals stats
36
58
43
36
37
62
54
50
46
52
31
42
41
40
29
76
78
69
66
81
46
56
52
56
40
12
42
38
38
28
39
46
51
54
41
30
51
54
52
32
22
32
43
28
22
9
40
47
30
24
X
~i
EXAMPLE IN R
X = read.table('pca.input',sep=" ",
header=TRUE)
pc = princomp(X)
mu = pc$center
Gamma = pc$loadings
Y = pc$scores
XminusMu=sweep(X,MARGIN=2,mu,FUN="-")
propOfVar= pc$sdev^2/sum(pc$sdev^2)
eigenVals= pc$sdev^2
Component Importance: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation
30.142 7.179 5.786 4.098 3.084
Proportion of Variance 0.890 0.050 0.032 0.016 0.009
Cumulative Proportion
0.890 0.941 0.974 0.990 1.000
Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
diffgeom
0.638 0.599 -0.407 -0.112 -0.237
complex
0.372 -0.230 0.593 -0.595 -0.320
algebra
0.240 -0.371
0.645 -0.624
(loadings)
reals
0.333 -0.671 -0.557 -0.234 0.271
statistics 0.535
0.414 0.404 0.615

Y(scores)
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
Comp.1
Comp.2
Comp.3
-2.292745
5.827588 8.966977
25.846460 13.457048 -3.257987
-14.856875
4.337867 -4.057297
70.434116 -3.286077 6.423473
13.768664 -4.392701 -6.058773
-28.899236 -4.611347 4.338621
5.216449 -4.536616 -7.625423
-3.432334 -11.115805 -3.553422
-31.579207
8.354892 -2.497369
-34.205292 -4.034848 7.321199
Center for Genes, Environment, and Health
Comp.4
-7.1630488
0.5344066
-2.5308172
3.9571310
-4.7551497
-2.2710490
2.2093319
-0.9908949
5.6986938
5.3113963
Comp.5
-2.2195936
0.4777994
1.4998247
0.8815369
-2.2951908
6.7118075
3.2618335
-4.1604420
-1.9742069
-2.1833687
Adapted from S-plus Guide to Statistics
9
Principal Components Analysis (PCA)
1
2
3
4
5
6
7
8
9
10
diffgeom complex algebra reals stats
36
58
43
36
37
62
54
50
46
52
31
42
41
40
29
76
78
69
66
81
46
56
52
56
40
12
42
38
38
28
39
46
51
54
41
30
51
54
52
32
22
32
43
28
22
9
40
47
30
24
X
~i
Component Importance: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation
30.142 7.179 5.786 4.098 3.084
Proportion of Variance 0.890 0.050 0.032 0.016 0.009
Cumulative Proportion
0.890 0.941 0.974 0.990 1.000
Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
diffgeom
0.638 0.599 -0.407 -0.112 -0.237
complex
0.372 -0.230 0.593 -0.595 -0.320
algebra
0.240 -0.371
0.645 -0.624
(loadings)
reals
0.333 -0.671 -0.557 -0.234 0.271
statistics 0.535
0.414 0.404 0.615

Y(scores)
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
Arrows for original variables:
Length=PropVarExplained in 2 comps
Direction=relative loadings in 2 comps
ex) diffgeom largest(++,++)
algebra smallest (+,-)
Comp.1
Comp.2
Comp.3
-2.292745
5.827588 8.966977
25.846460 13.457048 -3.257987
-14.856875
4.337867 -4.057297
70.434116 -3.286077 6.423473
13.768664 -4.392701 -6.058773
-28.899236 -4.611347 4.338621
5.216449 -4.536616 -7.625423
-3.432334 -11.115805 -3.553422
-31.579207
8.354892 -2.497369
-34.205292 -4.034848 7.321199
Comp.4
-7.1630488
0.5344066
-2.5308172
3.9571310
-4.7551497
-2.2710490
2.2093319
-0.9908949
5.6986938
5.3113963
Comp.5
-2.2195936
0.4777994
1.4998247
0.8815369
-2.2951908
6.7118075
3.2618335
-4.1604420
-1.9742069
-2.1833687
X = read.table('pca.input',sep=" ", header=TRUE)
pc = princomp(X)
## Verify Y = (X-mu)*Gamma
mu = pc$center
unique(Y-as.matrix(XminusMu)%*%Gamma)
Gamma = pc$loadings
## Verify X repr by Comp. i== Y[,i]
Y = pc$scores
XminusMu=sweep(X,MARGIN=2,mu,FUN="-") par(mfrow=c(2,1),pty="s"),biplot(pc)
plot(Y[,1],Y[,2],col="white")
propOfVar= pc$sdev^2 /sum(pc$sdev^2)
text(Y[,1],Y[,2],1:10)
eigenVals= pc$sdev^2
Center for Genes, Environment, and Health
Adapted from S-plus Guide to Statistics
10
Principal Components Analysis (PCA)
• If X is a random vector (mean , covariance matrix ), then
principal component transformation X  Y=(X-) s.t.  is
orthogonal, T   =  is diagonal, 1  2  … p  0.
1
2
3
4
5
6
7
8
9
10
diffgeom complex algebra reals stats
36
58
43
36
37
62
54
50
46
52
31
42
41
40
29
76
78
69
66
81
46
56
52
56
40
12
42
38
38
28
39
46
51
54
41
30
51
54
52
32
22
32
43
28
22
9
40
47
30
24
X
What if we could only
choose two dimensions?
Center for Genes, Environment, and Health
Adapted from S-plus Guide to Statistics
11
Clustering
• Partitioning
– Must specify number of clusters
– K-Means, Self-Organizing Maps (SOM/Kohonen Net)
• Hierarchical Clustering
– Do not need to specify number of clusters
– Need to specify distance metric and linkage method
• Other approaches
– Fuzzy clustering (probabilistic membership)
– Spectral Clustering (using eigen value decomposition)
Center for Genes, Environment, and Health
12
Clustering
http://apandre.wordpress.com/visible-data/cluster-analysis/
Center for Genes, Environment, and Health
13
R package: mlbench: Machine Learning Benchmark Problems
http://stackoverflow.com/questions/4722290/generating-synthetic-datasets
Center for Genes, Environment, and Health
14
k-Means
• Intitialize: Select the initial k Centroids
– REPEAT
• Form k clusters by assigning all points to
the ‘closest’ Centroid
• Recompute the Centroid for each cluster
– UNTIL ”The Centroids don’t change or all
changes are below predefined”
• Initial Centroids are random vectors, randomly selected among
vectors, first k vectors, etc or computed from random 1st assignment
• ‘closest’ typically defined by Euclidean distance (Voronoi diagram)
dist E ( x, y )  dist E ( y, x) 
2
n
 x  y 
i 1
i
i
• Prone to local maxima so typically do N random
restarts, take best (min sum of distE2 to centroids)
• In practice, favors separated spherical clusters
Center for Genes, Environment, and Health
Images from wikipedia
15
k-Means
Iteration 0
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Center for Genes, Environment, and Health
Images from wikipedia
http://en.wikipedia.org/wiki/K-means_clustering
16
Self-Organizing Maps (SOM)
• Similar to k-Means, goal to assign data to map node
(e.g. Centroid in k-Means) with ‘closest’ weight vector
to data space vector (minimize distE(x,w))
• Difference: map nodes constrained by neighborhood
relationships, whereas k-Means Centroids freely move
• Must input initial topology, map ‘stretches’ to cover nD
data in 2D, similar data assigned to map neighbors
Image from wikipedia
Center for Genes, Environment, and Health
17
Self-Organizing Maps (SOM)
•
•
•
•
•
http://www.sciencedirect.com/science/article/pii/S0014579399005244
Center for Genes, Environment, and Health
1. Initialization – Choose random
values for initial weight vectors wj.
2. Sampling – Draw a sample
training input vector x from the
input space.
3. Matching – Find the winning
neuron I(x) with weight vector
closest to input vector (i.e.,min distE)
4. Updating – Apply the weight
update equation
wji = (t)Tj,I(x) (t)( xi-wji)
where (t) = learning rate @ time t*
Tj,I(x) (t)=neighborhood @ time t
5. Continuation – keep returning
to step 2 until the feature map stops
changing.
* Informal intro to simulated annealing, gradient descent…
18
Self-Organizing Maps (SOM)
http://www.sciencedirect.com/science/article/pii/S0014579399005244
Center for Genes, Environment, and Health
19
Hierarchical Clustering
• Divisive – (top down) start with all
points in 1 cluster, successively subdivide ‘farthest’ points until full tree
• Agglomerative – (bottom up) start with
each point in its own cluster (singleton),
merge ‘closest’ pair of Clusters at each
step until root
– Requires metric to define ‘closest’ – distance
no longer between points, but between
clusters
– Linkage strategy for which merge is often
based on pairwise point comparisons
• Dendrogram shows order of splits
Center for Genes, Environment, and Health
20
• Euclidean
Distance Metrics
– distance in Euclidean space
• Pearson Correlation
– linear relationships
• Spearman Correlation
– monotonic relationships
• Mutual Information
– non-linear relationships
• Polyserial Correlation
dist E ( x, y ) 
2
n
 x  y 
i 1
i
i
n





x

x
y

y

i
i
i 1

dist P ( x, y )  1  
n
n

2
2 




x

x
y

y
i 1 i
 i 1 i

n


r

r
r

r

x
x
y
y
i
i
i 1

dist S (rx , ry )  1  
n
n

2
2 
r

r
r

r
i1 yi y 
 i 1 xi x
rz  rank (z)
dist MI ( x, y)  H ( x, y)  MI ( x, y)







MI ( x, y )  H ( x)  H ( y )  H ( x, y )
H ( x)   x p x log p x and H ( x, y )   x , y p x , y log p x , y
– correlation continuous vs ordinal (polychoric if ordinal vs ordinal)
• Hamming Distance, Jaccard, Dice (binary variables)
dist H  M 01  M 10 dist J 
M 01  M 10
2 X  Y Like Jaccard but
Good when 0 dist  1 
D
M 01  M 10  M 11 gives no info
X  Y 2*Matches
Center for Genes, Environment, and Health
21
Distance Metrics
• Euclidean vs Pearson (linear) vs Spearman (monotonic)
Numbers are Pearson correlation
Note Pearson invariant to slope
Pearson=0 if non-linear
Center for Genes, Environment, and Health
A
A 1
1
0
8
-1
0
0
1 -1 0.8
1 -1 1
8 9 6
0 1 6
-1 1 -0.7
0 0 0.3
0
0 0
0 0 Pearson
0 0 Spearman
17 19 EucDist
22 23 EucDist
0
0 Pearson
1 0.85 Pearson
1 0.91 Spearman
22
Linkage Methods
• Single Linkage
argmin S,T min sS,tT dist(s,t)
• Complete Linkage
argmin S,T max sS,tT dist(s,t)
• Average Linkage (a.k.a. group average)
argmin S,T average sS,tT dist(s,t)
• Centroid Linkage (People err after Eisen et al 1998 Treeview
paper think=Average Linkage!) – min dist(centroid(S), centroid(T))
• Ward’s Linkage (optimizes same criterion as kMeans)
• UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
from Lozupone lecture – assumes constant rate of evolution,
average linkage, Euclidean distance
Center for Genes, Environment, and Health
23
R package: mlbench: Machine Learning Benchmark Problems
http://stackoverflow.com/questions/4722290/generating-synthetic-datasets
Center for Genes, Environment, and Health
24
Murder Assault UrbanPop Rape
Alabama
13.2
236
58 21.2
Alaska
10.0
263
48 44.5
Arizona
8.1
294
80 31.0
Arkansas
8.8
190
50 19.5
California
9.0
276
91 40.6
Colorado
7.9
204
78 38.7
Center for Genes, Environment, and Health
Murder
Assault
UrbanPop
Rape
Comp.1 Comp.2
-0.53 0.41
-0.58 0.18
-0.27 -0.87
-0.54 -0.16
25
Murder Assault UrbanPop Rape
Alabama
13.2
236
58 21.2
Alaska
10.0
263
48 44.5
Arizona
8.1
294
80 31.0
Arkansas
8.8
190
50 19.5
California
9.0
276
91 40.6
Colorado
7.9
204
78 38.7
Center for Genes, Environment, and Health
Murder
Assault
UrbanPop
Rape
Comp.1 Comp.2
-0.53 0.41
-0.58 0.18
-0.27 -0.87
-0.54 -0.16
26
Center for Genes, Environment, and Health
27
Choosing the Number of Clusters
• Rule of thumb: k= n/2
• Elbow or Knee method (bend in plot of metric)
• K-means likes spherical so minimize
within-cluster variation (SSE, sum dist of
all points to cluster mean) or maximize
between-cluster variation (dist between
clusts) or both CH(K)=[B(K)/K-1]/[W(K)/(n-K)]
*Calinski & Harabasz 1974
• Gap Statistic
B(K)
W(K)
CH(K)
*Tibshirani, Walther, Hasties 2001
– Calculate SSE, randomize dataset, calculate
SSE rand, n times, gap= log(mean SSErand/ SSE)
• Hierarchical – plot dist chosen at each
merge (okay for single, complete)
See also http://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf for
long list of indices, NbClust R package: http://cedric.cnam.fr/fichiers/art_2579.pdf and
http://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3.pdf
Center
for Genes, Environment, and Health
Gap(K)
28
Association Set Mining
• Also known as Market Basket Analysis
{milk, eggs}  {butter}
• Support of itemset X
supp(X) = # transactions with itemset X
• Confidence of rule
conf(X Y) = supp(X &Y)/ supp(X)
• Lift of rule (perf over assuming independent)
lift(X Y) = supp(X &Y)/ (supp(X)*supp(Y))
• Want rules with max supp, conf, lift
• Other measures found at:
http://michael.hahsler.net/research/association_rules/measures.html
Center for Genes, Environment, and Health
29
Association Set Mining
• Tables of data converted to transactions by creating
binary variables for all categories for all variables
(must discretize continuous, missing data okay)
ID
Gender
Age
Height
(inches)
Race
Diagnosis
CC245
Male
6
25
Caucasian
Depression
CC346
Male
75
60
African
COPD
30
54
Asian
Obesity
15
54
African
CC978
CC125
Female
{ {gender_M=Y, age_child=Y, height_20-29=Y, race_WH=Y, diag_depr=Y},
{gender_M=Y, age_senior=Y, height_60-69=Y, race_BL=Y, diag_copd=Y},
{age_adult=Y, height_50-59=Y,race_AS=Y, diag_obes=Y},
{gender_F=Y, age_adol=Y, height_50-59=Y, race_BL=Y} }
Center for Genes, Environment, and Health
30
Association Set Mining
Example in R: arules pkg, apriori algorithm
lhs
{Class=2nd,
Age=Child}
{Class=2nd,
Sex=Female,
Age=Child}
{Class=1st,
Sex=Female}
{Class=1st,
Sex=Female,
Age=Adult}
1
2
3
4
rhs
support confidence
lift
=> {Survived=Yes}
0.011
1.000
3.097
=> {Survived=Yes}
0.006
1.000
3.096
=> {Survived=Yes} 0.064
0.972 3.010
Note that rule 2 subsumed by rule 1, which has
better lift (and support) – can remove redundants
=> {Survived=Yes} 0.064
0.972 3.010
…
12 {Sex=Female,
Survived=Yes} => {Age=Adult}
27 {Class=2nd} => {Age=Adult}
Center for Genes, Environment, and Health
0.143
0.118
0.918
0.915
0.966
0.963
31
Probabilistic Graphical Models
Y X
Markov Process (MP)
Xt− 1 Xt
Time
Observability
Hidden
Markov
Model
(HMM)
X t-1 Xt
Ot-1 Ot
A t− 1
Markov
Decision
Process X t − 1
(MDP)
U t− 1
Center for Genes, Environment, and Health
At
Utility
Partially
Observable
X t Markov
Decision
Process
U (POMDP)
t
Observability
and Utility
A t− 1
Ot− 1
At
Xt− 1
Xt
U t− 1
Ut
Ot
32
Hidden Markov Model
• Finite set of N states X
• Finite set of M observations O
• Parameter set λ = (A, B, π)
Hidden
Markov
Model
(HMM)
– Initial state distribution πi = Pr(X1 = i)
– Transition probability aij = Pr(Xt=j | Xt-1 = i)
– Emission probability bik = Pr(Ot=k | Xt = i)
Example:
1
2
3
N=3, M=2
π=(0.25, 0.55, 0.2)
obs1
st1
st2
st3
A =st1 0 0.2 0.8 B = st1  0.1

st2  0
0
.
9
0
.
1



0
0 
st3 1.0
X t-1 Xt
Ot-1 Ot
obs2
0.9 
st2 0.75 0.25

st3  0.5
0.5 
• Given observation sequence O=O1,O2,…,On, how compute Pr(O| λ)?
Center for Genes, Environment, and Health
33
Example:
1
πi = Pr(X1 = i)
aij = Pr(Xt=j | Xt-1 = i)
bik = Pr(Ot=k | Xt = i)
2
3
N=3, M=2
π=(0.25, 0.55, 0.2)
A =  0 0.2 0.8 B =  0.1
 0 0.9 0.1


1.0 0
0 
0.9 
0.75 0.25
 0.5 0.5 


• Probability of O is sum over all state sequences
Pr(O|λ) = ∑all X Pr(O|X, λ) Pr(X|λ)
= ∑all X πx1 bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT
• At each t, are N states to reach, so NT possible state sequences and 2T
multiplications per seq, means O(2T*NT) operations
• So 3 states, length 10 seq = 1,180,980 operations and len 20 = 1e11!
• Efficient dynamic programming algo: Forward algorithm (Baum&Welch)
O(N2T)
34
Center for Genes, Environment, and Health
Applications in Bioinformatics
• DNA – motif matching, gene matching,
multiple sequence alignment
• Amino Acids – domain matching, fold
recognition
• Microarrays/Whole Genome Sequencing –
assign copy number
• ChIP-chip/seq – distinct chromatin states
Center for Genes, Environment, and Health
35
Bayesian Networks
• Given set of random variables,
the joint probability distribution
can be represented by:
– Structure: Directed Acyclic Graph
(DAG)
• variables are nodes, absence of arcs
captures conditional independencies
– Parameters: Local Conditional
Probability Distributions (CPDs)
• conditional probability of variable given
values of parents in graph
• Joint Probability factors into
product of local CPDs:
Pr(X1, X2, …, Xn) =  i=1 to N Pr(Xi | Parents(Xi))
Center for Genes, Environment, and Health
36
Bayesian Networks
• Generally can think of directed arcs as
‘causal’ (be careful!)
– If the sprinkler is on OR it is raining, then the
grass will be wet: Pr(W|S,R)
• If observe wet grass, can determine whether
because of sprinkler or rain
– Pr(R|W) and Pr(S|W)
– Bayes rule = Pr(X|Y) = Pr(Y|X)Pr(X)/Pr(Y)
• Note S and R compete to explain W: this
model says sprinkler usage is (conditionally)
independent of rain, but if know the grass is
wet, and it is raining, then it is less likely that
the sprinkler being on is the explanation for
W
– Pr(S|W,R) < Pr(S|W)
“explaining away”
Center for Genes, Environment, and Health
http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html
37
16873470
Applications in Bioinformatics
Gene regulatory networks
(Friedman et al, 2000, PMID: 11108481)
Determining Regulators with PRMS
(Segal et al, 2002, RECOMB)
Predicting clinical outcomes
using expression data
(Gevaert et al, 2006, PMID: 16873470)
Gene Function Prediction
(Troyanskaya et al, 2003, PMID: 12826619 )
Center for Genes, Environment, and Health
Hanalyzer – edge scores
(Leach et al, 2009, PMID: 19325874)
38
Supervised Learning
Center for Genes, Environment, and Health
39
Supervised Learning
• Given examples (x,y) of input features x and
output variable y, learn function f(x)=y
– Regression (continuous response) vs Classification
(discrete response)
– Feature selection vs Feature (Dimensionality) Reduction
– Cross validation (Leave-One-Out vs N-Fold)
– Generalization (Training set error vs Test set error)
– Model Selection (AIC, BIC)
– Boosting/bagging/jackknife
– Missing data and Imputation
– Curse of dimensionality
Center for Genes, Environment, and Health
40
Supervised Learning
• Boosting (weak learners on different subsets)
– Train H1 on random data split, sample among H1’s predictions so next
data set to train H2 has half wrong, half right in H1. Train H3 where
both H1 and H2 wrong. Return majority vote H1, H2, H3 (Adaboost
weights examples, weighted vote)
• Bagging (bootstrap aggregate)
– Train multiple models on random with replacement (bootstrap) splits
of input data, average predictions
• Jackknife (vs bootstrap) – disjoint subsets of data
• Model Selection: balance goodness of fit (likelihood L) with
complexity of model (number of parameters k) for n samples
– Bayesian information criterion (BIC): minimize k ln(n)-2 ln(L)
– Akaike information criterion (AIC): minimize 2k – 2 ln(L) (less strong,
better theory than BIC)
• Curse of dimensionality – greater D, data samples sparser in
covering space so need more & more data to learn properly
Center for Genes, Environment, and Health
41
Decision Boundaries
https://sites.google.com/a/iupr.co
m/dia-course/lectures/lecture08classification-with-neural-networks
Center for Genes, Environment, and Health
42
k-Nearest Neighbors
• Store database of (x,y) pairs, classify new example by
majority vote of k nearest neighbors (regression if
assign (weighted) mean y in neighborhood)
• No training needed, non-parametric,
sensitive to local structure in data,
frequent class tends to dominate If k=3, (green, red)
If k=5, (green, blue)
• Curse of dimensionality if many
variables, any query equidistant to
all points – reduce features by PCA
• Allows complicated boundaries
between classes
Center for Genes, Environment, and Health
43
Neural Network: Linear Perceptron
Step activation function
• Learning :(Backpropagation)
•
•
•
•
Initialize wt, choose learning rate 
1) Calculate prediction y*j,t = f[wt  xj]
2) Update weights wt+1 = wt+(yj – y*j,t)xj
Repeat 1&2 until (yj – y*j,t) < threshold
– Can be generalized to multi-class
– Optimal only if data linearly separable
vs
Center for Genes, Environment, and Health
44
Neural Network: Multi-Layer Perceptron
• Smooth activation
function instead
• Can also have
multiple hidden
layers
• Can learn when data
not linearly separable
• Learn like before but
backpropagation from
output layer
Smooth activation function
(signmoid, tanh)
Input layer
Center for Genes, Environment, and Health
Hidden layer
Output layer
45
Decision Tree
• Node is attribute tested, branch
BIOPSY+
is outcome, leaf is (majority) class (prob) Y
N
• Discrete: X=xi?, Real: X<value?
Rx SIDE
BREATH
EFFECT
>90%
• Greedy algorithm chooses
Y
N
Y
N
best attribute to split upon:
Died:
Died:
Died:
BREATH
15
– pi = fraction items labeled i in set
<30%
Alive:
15
– Gini impurity: IG(p) =ij pipj
Y
N
prob items labeled i chosen *
Died:
Died:
prob i mistakenly assigned class j
80
30
Alive:
Alive:
– Information gain: IE(p) =-i pi log2pi
1
7
20
Alive:
57
3
Alive:
27
– Real value: SSE
• EASY TO INTERPRET!!! Can overfit, large tree for XOR,
biased in favor of attributes with more levels => ensembles
Center for Genes, Environment, and Health
46
Random Forest
• Classifier consisting of ensemble of decision trees
{h(x, k)} where k is some i.i.d. random vector, and
each tree casts vote for class of x (Breiman 2001)
1. Bagging – k is random selection of N samples (with
replacement) to grow tree
2. Dietterich 98: k is random split among n (best) splits
3. Ho 98: k is random subset of features to grow tree (√k )
4. Adaboost-like: k is random weights on examples
– 4 better than {2,3} better than 1 on generalization error
• Out-of-bag estimates : internal estimates of
generalization error, classifier strength and correlation
between trees
Center for Genes, Environment, and Health
47
Random Forest
• Most popular implementation {h(x, k)}: bagging
(random subset samples w/ repl.) + random subset features
– If set of features small, trees more correlated, so can make new
features as random linear combinations of orig. features
• Out-of-bag classifier for specific {x,y} = aggregate over
trees that didn’t use {x,y} as training data (removes need for
setting aside test data)
• Out-of-bag estimate is error rate for out-of-bag classifer
for training set (can also estimate OOB strength and correlation)
• Can estimate variable importance from OOB estimates
– For m-th variable, permute values, compare misclassification
rate of OOB classifiers on ‘noised-up’ data with OOB on real data,
large increase implies m-th variable important
Center for Genes, Environment, and Health
48
Support Vector Machine (SVM)
• Support vectors are points that lie
closest to decision surface, maximize
‘margin’, hyperplane separating
examples (solution change if SVs
removed)
• Kernel function – maps not-linearly
separable data to transformed space
where transformed data is lin. sep.
• Advantages: non-probabilistic,
optimization not greedy search, not
affected by local minima, theoretical
guarantee of performance, escape
curse of dimensionality
Center for Genes, Environment, and Health
49
Support Vector Machine (SVM)
• Distance between H and H1 is
1/||w|| so to maximize margin,
need to minimize ||w||= sqrt(i wi2)
s.t. no points between H1&H2:
xi w + b  +1 when yi = +1
xi w + b  -1 when yi = -1
yi(xi w)1
+
+
+
• Quadratic program (constrained optimization, solved by
(dual of) Lagrangian multiplier)
Max L = i- ½ijxixj s.t w=iyixi and iyi=0
• If not linearly separable, use transformation to space
where is linearly separable, via kernels, i.e. (xi) not xi
• If use L1-norm (not L2 above), weights = variable importance
http://books.nips.cc/papers/files/nips16/NIPS2003_AA07.pdf
Center for Genes, Environment, and Health
http://www.cs.ucf.edu/courses/cap6412/fall2009/papers/Berwick2003.pdf
50
Support Vector Machine (SVM)
Not separated by
linear function, but
can by quadratic
one
2

 x  x' 

p
Polynomial
(p=1
linear)
K ( x, x' )  exp 
K ( x, x' )  ( x  x'1)

2
2



~sigmoid (like Neural Net)


K ( x, x' )  tanh( x  x' )
Radial basis function (Gaussians)
Center for Genes, Environment, and Health
51
Other Useful Kernel Functions
• Use of kernels allows complex data types to be
used in SVMs w/o having to translate into realvalued, fixed length vectors
K: D x D  R
• String kernel: compare two sequences
• Graph kernel: compare two nodes in graph or
two graphs
• Image kernels: compare two images
• and so on … (any symmetric, positive semidefinite matrix is a kernel)
Center for Genes, Environment, and Health
52
Center for Genes, Environment, and Health
53
Naïve Bayes
C
• Recall Bayes rule
Pr(X|Y) = Pr(Y|X)Pr(X) / Pr(Y)
• Classifier:
F1
F2
F3
…
Fn
Pr(C|F1,…,Fn ) = Pr(C) Pr(F1,…,Fn|C) / Pr(F1,…,Fn)
– Note denominator does not depend on C (effectively constant Z)
– “Naïve” assumption because assume Fi, Fj independent
– Simplifies calculation:
Pr(C|F1,…,Fn ) = 1/Z * Pr(C) i Pr(Fi|C)
• Learn parameters Pr(C) & each Pr(Fi|C) by
maximum likelihood (multinomial, Gaussian, …)
– Can learn each Pr(Fi|C) independently, escape curse
of dimensionality, not need dataset to scale with # Fi
Center for Genes, Environment, and Health
54
Center for Genes, Environment, and Health
55
Examples in R
• Making 2D datasets
– Install libraries: mlbench
• Clustering (Hierarchical, K-Means, SOM)
– Install libraries: kohonen
• Classification (kNN, NN, DT, SVM, NB)
– Install libraries: class (if R>3.0, o/w knn), neuralnet,
rpart, e1071
Center for Genes, Environment, and Health
56
R package: mlbench: Machine Learning Benchmark Problems
http://stackoverflow.com/questions/4722290/generating-synthetic-datasets
Center for Genes, Environment, and Health
57
Additional References
•
•
•
•
•
Logit regression example: http://www.ats.ucla.edu/stat/r/dae/logit.htm
PCA: http://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf
Statistical Pattern Recognition Toolbox for Demos: http://cmp.felk.cvut.cz/cmp/software/stprtool/examples.html
KMeans: https://onlinecourses.science.psu.edu/stat857/node/125
SOMS:
–
–
–
•
Distance metrics:
–
–
–
•
http://www.statmethods.net/stats/correlations.html
http://people.revoledu.com/kardi/tutorial/Similarity/index.html – nice discussion of differences
http://www.datavis.ca/papers/corrgram.pdf - make visual panel (like heatmap) of correlation between variables
Choosing number of clusters:
–
–
–
–
–
–
•
http://www.cs.bham.ac.uk/~jxb/NN/l16.pdf
http://www.loria.fr/~rougier/coding/article/article.html
http://www.sciencedirect.com/science/article/pii/S0014579399005244
Nice one: http://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf
http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
http://psycnet.apa.org/journals/met/16/3/285/
http://blog.echen.me/2011/03/19/counting-clusters/
http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters
http://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3.pdf
Neural Networks:
–
–
–
–
Good one: http://www.cogsys.wiai.uni-bamberg.de/teaching/ss05/ml/slides/cogsysII-4.pdf
https://sites.google.com/a/iupr.com/dia-course/lectures/lecture08-classification-with-neural-networks
https://sites.google.com/a/iupr.com/dia-course/lectures/lecture08-classification-with-neural-networks
Nice for MLP: http://users.ics.aalto.fi/ahonkela/dippa/node41.html
•
•
•
Boosting vs Bagging: http://people.cs.pitt.edu/~milos/courses/cs2750-Spring04/lectures/class23.pdf
Random Forests: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
SVMs:Idiots’ guide to SVMs: http://www.cs.ucf.edu/courses/cap6412/fall2009/papers/Berwick2003.pdf
•
Kernel Methods: http://www.kernel-methods.net/tutorials/KMtalk.pdf
Center for Genes, Environment, and Health
58
The End
Center for Genes, Environment, and Health
59
Not used
Center for Genes, Environment, and Health
60
Hidden Markov Model
• Finite set of N states X
• Finite set of M observations O
• Parameter set λ = (A, B, π)
Hidden
Markov
Model
(HMM)
– Initial state distribution πi = Pr(X1 = i)
– Transition probability aij = Pr(Xt=j | Xt-1 = i)
– Emission probability bik = Pr(Ot=k | Xt = i)
Example:
1
2
3
N=3, M=2
π=(0.25, 0.55, 0.2)
obs1
st1
st2
st3
A =st1 0 0.2 0.8 B = st1  0.1

st2  0
0
.
9
0
.
1



0
0 
st3 1.0
X t-1 Xt
Ot-1 Ot
obs2
0.9 
st2 0.75 0.25

st3  0.5
0.5 
• Given observation sequence O=O1,O2,…,On, how compute Pr(O| λ)?
Center for Genes, Environment, and Health
61
Example:
1
πi = Pr(X1 = i)
aij = Pr(Xt=j | Xt-1 = i)
bik = Pr(Ot=k | Xt = i)
2
3
N=3, M=2
π=(0.25, 0.55, 0.2)
A =  0 0.2 0.8 B =  0.1
 0 0.9 0.1


1.0 0
0 
0.9 
0.75 0.25
 0.5 0.5 


• Probability of O is sum over all state sequences
Pr(O|λ) = ∑all X Pr(O|X, λ) Pr(X|λ)
= ∑all X πx1 bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT
• What is computational complexity of this sum?
Center for Genes, Environment, and Health
62
Example:
1
πi = Pr(X1 = i)
aij = Pr(Xt=j | Xt-1 = i)
bik = Pr(Ot=k | Xt = i)
2
3
N=3, M=2
π=(0.25, 0.55, 0.2)
A =  0 0.2 0.8 B =  0.1
 0 0.9 0.1


1.0 0
0 
0.9 
0.75 0.25
 0.5 0.5 


• Probability of O is sum over all state sequences
Pr(O|λ) = ∑all X Pr(O|X, λ) Pr(X|λ)
= ∑all X πx1 bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT
• At each t, are N states to reach, so NT possible state sequences and
2T multiplications per seq, means O(2T*NT) operations
• So 3 states, length 10 seq = 1,180,980 operations and len 20 = 1e11!
63
Center for Genes, Environment, and Health
Example:
1
πi = Pr(X1 = i)
aij = Pr(Xt=j | Xt-1 = i)
bik = Pr(Ot=k | Xt = i)
2
3
N=3, M=2
π=(0.25, 0.55, 0.2)
A =  0 0.2 0.8 B =  0.1
 0 0.9 0.1


1.0 0
0 
0.9 
0.75 0.25
 0.5 0.5 


• Probability of O is sum over all state sequences
Pr(O|λ) = ∑all X Pr(O|X, λ) Pr(X|λ)
= ∑all X πx1 bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT
• Efficient dynamic programming algorithm to do
this: Forward algorithm(Baum and Welch,O(N2T))
Center for Genes, Environment, and Health
64
The Forward Algorithm
CpG
G
C
A
T
0.8
.3
.3
.2
.2
0.2
G
C
A
T
Probability of a Sequence is the Sum of All
Paths that Can Produce It
.3*(
.3*.8+
.1*.1)
=.075
.3*(
G .1
G
G .3
0.1
.1
.1
.4
.4
0.9
=.0185
.2*(
.0185*.8+
.0029*.1)
=.003
.2*(
.003*.8+
.0025*.1)
=.0005
.1*(
.3*.2+
.1*.9)
=.015
.1*(
.075*.2+
.015*.9)
=.0029
.4*(
.0185*.2+
.0029*.9)
=.0025
.4*(
.003*.2+
.0025*.9)
=.0011
C
G
A
A
.075*.8+
.015*.1)
David Pollock’s Lecture
Non-CpG
Center for Genes, Environment, and Health
65
Parameter estimation by Baum-Welch
Forward Backward Algorithm
Forward variable αt(i) =Pr(O1..t,Xt=i | λ)
Backward variable βt(i) =Pr(Ot+1..N|Xt=i, λ)
DEFINITIVE tutorial: Rabiner 1989: http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Leach.pdf
and erratum: http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Erratum_Leach.pdf
Forward Algorithm
• Dynamic programming method to compute
forward variable: αt(i) =Pr(O1..t,Xt=i | λ)
• Base Condition: for 1  i  N
α1(i) = πx1 bxio1
• Recurrence: for 1  j  N and 1  t  T-1
αt+1(j) = [ ∑i=1 to N αt(i) axixj ] bxjot+1
• Then probability of sequence
Pr(O | λ) = ∑i=1 to N αT(i)
Center for Genes, Environment, and Health
*Backward algorithm
for βt(i) is analogous
67