Download Machine Learning in Computational Biology CSC 2431

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Oncogenomics wikipedia , lookup

Gene expression profiling wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Machine Learning in
Computational Biology
CSC 2431
Lecture 9: Combining biological datasets
Instructor: Anna Goldenberg
What kind of data integration is there?
What kind of data integration is there?
—  SNPs
and gene expression
—  Networks and gene expression (and
mutations)
—  ENCODE data. Combining different
epigenetic signals and binding info
—  Ontologies and genome annotations
—  Now: integrating
patient data
Data is available
E.g. The Cancer Genome Atlas (TCGA)
Total of 33 cancers.
9 cancers have over 500+ samples
All publicly available!
Why integrate patient data
Why integrate patient data
—  To
identify more homogeneous subsets of
patients (that might respond similarly to a
given drug)
—  To
help better predict response to drugs
mRNA
mutations
more
genes
CNV
clinical
more
genes
p-value = {0.2,0.6,0.5}
(Verhaak et al, Cancer Cell, 2010)
mRNA
mutations
more
genes
CNV
clinical
more
genes
p-value = {0.2,0.6,0.5}
What about methylation data?
(Verhaak et al, Cancer Cell, 2010)
More recent GBM study (Sturm et al, 2012)
Methods used in Verhaak 2010
— 
Factor analysis – a dimensionality reduction method –
used to integrate mRNA data from 3 platforms
— 
Consensus clustering (consensus average linkage
clustering) (Monti et al, 2003)
— 
SigClust – cluster significance (Liu et al, 2008)
— 
Silhouette to identify core of clusters (Rousseeuw,1987)
— 
ClaNC – nearest centroid-based classifier to identify
gene signatures (Dabney, 2006)
More recent GBM study (Sturm,
2012)
—  Missing
values – imputed using k-NN
(Troyanskaya, 2001)
—  Unsupevised consensus clustering (R:
clusterCons) (Monti, 2003, Wilkerson and
Hayes, 2010)
—  Consensus matrix was calculated using
the k-means algorithm
—  Number of clusters is decided by visual
assessment
Breast Cancer Analysis (TCGA,2012)
— 
— 
— 
— 
— 
— 
Integrated pathway analysis using PARADIGM
Significantly mutated genes were identified using MuSiC
package
NMF for unsupervised clustering of somatic and CNV
data, protein expression
RPMM – recursively partitioned mixture model (RPMM
Bioconductor package)
ConsensusClusterPlus (R-package) to combine
clustering based on single data type
MEMo (Mutual Exclusivity Modules) – identifies mutually
exclusive alterations targeting frequently altered genes
that are likely to belong to the same pathway
PARADIGM
— 
Infers Integrated Pathway Levels (IPLs) for genes, complexes,
and processes using pathway interactions and genomic and
functional genomic data from a single patient sample.
— 
Data:
◦  mRNA relative to normal samples
◦  CNVs mapped to genes
◦  Networks: Biocarta (Biocarta, NCIPID, Reactome) –
Superimposed into SuperPathway
— 
Approach: belief propagation to maximize likelihood
(hear more next class!)
Vaske, C. J. et al. Inference of patient-specific pathway activities from multidimensional cancer genomics data using PARADIGM. (2010) Bioinformatics 26
Silhouette statistic
Subtype
1
2
3
−0.2
0
0.2
0.4
0.6
Silhouette Value
0.8
1
Silhouette statistic
a. 
b. 
c. 
d. 
Three clusters in 2 dimensions
Three clusters in 10 dimensions, each cluster has 50 observations
4 clusters in 10 dimensions with randomly chosen centers
Six clusters in 2 dimensions
(a)
(d)
Silhouette statistic
a. 
b. 
c. 
d. 
Three clusters in 2 dimensions
Three clusters in 10 dimensions, each cluster has 50 observations
4 clusters in 10 dimensions with randomly chosen centers
Six clusters in 2 dimensions
Hossein Parsaei. Finding a number of clusters
NMF – non-negative matrix
factorization
— 
Matrix factorization: NMF(V) = WxH
— 
W and H are non-negative
— 
Current methods (many – gradient descent,
alternating non-negative least squares, etc)
— 
Arora et al (2012) – exact NMF method runs in
polynomial time under separability condition of
W
Consensus Clustering
Resampling based method for class discovery and
visualization of gene expression microarray data
—  Goal: assessing stability
—  Method:
— 
◦  For a 1000 iterations
1. 
2. 
Resample data
Cluster with fav. clust. method (hier, k-means)
◦  Compute consensus matrix
◦  Partition D based on Consensus Matrix
Monti, S., Tamayo, P., Mesirov, J., Golub, T. (2003) Consensus Clustering: A ResamplingBased Method for Class Discovery and Visualization of Gene Expression Microarray
Data. Machine Learning, 52, 91-118.
SigClust
Goal: assess statistical signficance of clustering
—  H0: data comes from a single Gaussian
—  H1: not from a single Gaussian
—  Statistic: Cluster Index (CI) - sum of within-class
sums of squares about the mean of the cluster
divided by the total sum of squares about the
overall mean (mean-shift and scale invariant)
— 
Liu,Yufeng, Hayes, David Neil, Nobel, Andrew and Marron, J. S, 2008, Statistical
Significance of Clustering for High-Dimension, Low-Sample Size Data, Journal of the
American Statistical Association 103(483) 1281–1293
Patient Specific Data Fusion (Yuan
et al, 2011)
—  Nonparametric
Bayesian model (gene
expression and CNV)
◦  Feature selection (each feature is drawn from
a multinomial distribution with unknown class
proabilities
◦  MCMC inference
Kernel
LearningLearning
10.3 Multiple
Multiple
Kernel
—  Mostly
used in supervised cases, but
Multipleexists
Kernel
Learning
(MKL)
learn
in unsupervised scenario (Chuang,
improvesCVPR,
the2012)
performance of classifier
—  Linear
of function
kernels
1, . . . , m,
thecombination
objective
is
to
l
Pm
kernel Kcombine = v=1 ↵v Kv more sui
MKL is used in supervised setting bec
optimal ↵. Recently, an unsupervised M
with spectral clustering framework. Ei
h-dimensional patient-by-feature matrix was then used as input into a
orithm available as part of matlab distribution that yielded a set of cl
ion as the distance metric and ‘average’ as the linkage function. The
hosen to be the same as the result of clustering of the SNF fused matrix.
iCluster (Shen et al, 2009)
latent variable model
Sparsity
regularization
(Lasso-type)
Gaussian latent— variable
model
with sparsity regularization
in Lasso-type o
riefly, the main
assumption
behind this
approach is that
the sets of m ge
— 
Latent
variables
(embedding
is
shared)
m
ter
k=1
—  Gaussian
shared a common set of latent variables zi using the following linear
xik = Wk zi + ✏ik , i = 1, . . . , n, k = 1, . . . , m
notes the loading matrix associated with the k-th genomic data and n is
on variables zi represent the underlying driving factors on patient i that c
ase subtype assignment. iCluster uses the Expectation-Maximization (EM
arameteres due to the assumption that the error in the model follows a G
arsity in the estimated Wk is enforced by adding an `1 norm regulariza
uggested in the method’s manual.
Drawbacks of existing methods
—  A
lot of manual processing
—  Many steps in the pipeline
—  Integration mostly done in the feature
space – if there is signal in a combination
of features, it’ll be lost
—  Focusing on consensus – what if there is
complementary information?
Similarity Network Fusion (Wang et al, 2014)
—  Integrate
1. 
2. 
data in the patient space
Construct patient similarity matrix
Fuse multiple matrices
than
patients
that, N
have
di↵erent subtypes. We denote ⇢(xi , x
where
mean(⇢(x
i
i )) is the average value of the correlations bet
between
xi and xj . We then use a scaled exponent
each of patients
its neighbours.
1.determine
to
the weight
the edge eofijthe
: patient network is tw
The advantage
of our of
construction
Construct similarity networks
augments the correlation between patients which facilitates the cl
2 the data.
afterwards; 2) it reduces the e↵ect of scale and
noise
in
⇢(x
,
x
)
i
j
W
(i,
j)
=
exp(
),
Patient
similarity:
A natural kernel acting on functions on V can
be
defined
by n
2
⌘⇠ij
of the weight matrix as follows:
W (i,be
j) empirically set
where ⌘ isAdjacency
a hyperparameter
that
can
P (i, j) = P
,
matrix:
eliminate the scale problem. In our paper,
wek)define
k2V W (i,
P
Patients
Patients
so that j2V P (i, j) = 1.
mean(⇢(x
mean(⇢(x
i , Ni )) +
j , Nj ))
Given amRNA
graph,
G,
we
construct
another
graph
G:
the
vertices
Patients
expression
⇠ij =
genes
same as in G, and the similarities between non-neighboring
points
2
the pairwise similarity values) are set to zero. Essentially we make
tion that local similarities (high values) are more reliable than
and we thus assign similarities to non-neighbors through graph di↵
network. This is a mild assumption widely adopted by other mani
algorithms.
Using K nearest neighbors (KNN) to measure local affinity, we
Patients
Construct similarity networks
Patients
1.
and we This
thus assign
similarities
to non-neighbors
through
graphmanifold
di↵usion
network.
is a mild
assumption
widely adopted
by other
network. This is a mild assumption widely adopted by other manifold le
algorithms.
algorithms.
Using K nearest neighbors (KNN) to measure local affinity, we const
Usingmatrix
K nearest
similarity
as: neighbors (KNN) to measure local affinity, we constru
⇢
similarity matrix as:
W (i, j) if x 2 KN N (x )
W(i, j) = ⇢ W (i, j) if xjj 2 KN N (xi )i
0 otherwise
1) W(i, j) =
0 otherwise
ThenSparsification
the corresponding kernel becomes:
Then the corresponding kernel becomes:
W(i, j)
2)
P(i, j) = P
W(i, j)
P(i, j) = Pxk 2KN N (xi ) W(i, k)
xk 2KN N (xi ) W(i, k)
Note that P carries the full information about the similarity of each da
Note that P carries the full information about the similarity of each data
to all others whereas P only encodes the similarity top2nearby data poi
to all others
whereas P only encodes
the similarity to nearby data poin
Patients
expression
clarity,mRNA
wegenes
call P the status matrix
and P the kernel
matrix. Our al
p1
clarity, we call P the status matrix and P the kernel
matrix. Our algo
always starts from P as the initial status using P as the kernel matri
always starts from P as the initial status using P as the kernel matrix
di↵usion process for computational efficiency.
di↵usion process for computational efficiency.
p9
3 3 Cross
Di↵usion
Process
(CrDP)
with
m
=
2
Simi
Cross Di↵usion Process (CrDP) with m = 2 Simila
Matrices
Matrices(Views)
(Views)
p8
Given
mm
views
can construct
constructsimilarity
similaritymatrice
matric
Given
viewsfrom
fromdi↵erent
di↵erentdomains,
domains, we
we can
(j)(j)
(j)
(j)
(j)
(j)
and
W
using
Eq
4
for
the
j-th
view,
j
=
1,
.
.
.
,
m.
P
and
P
areobo
and W using Eq 4 for the j-th view, j = 1, . . . , m. P and P are
3
Cross Di↵usion Process (CrDP) with
Matrices (Views)
2. Combine
networks
Given
m views from di↵erent domains, we can construct
and W (j) using Eq 4 for the j-th view, j = 1, . . . , m. P
from Eqs 3Fusion
and 5Iterations
respectively.
Similarity Networks
Below we introduce our network fusion Cross-Di
First, we calculate the status matrices P (1) and P (2) a
similarity matrices; then the kernel matrices P (1) and
(1)
(2)
Eq 5. Let P0 = P (1) and P0 = P (2) . The cross-di↵us
(1)
(2)
(2)
(1)
Pt+1 = P (1) ⇥ (Pt ) ⇥ (P (1) )0
Pt+1 = P (2) ⇥ (Pt ) ⇥ (P (2) )0
Patient
Patient similarity:
mRNA-based
DNA Methylation-based
Supported by all data
unknown class probabilities. Multiple MCMC c
and infer the statical uncertainties in PSDF. In o
100 MCMC iterations in each step and fusion we
While PSDF appears to be a powerful frame
are essential
precluding
Similarity Networks
Fusion disadvantages
Iterations
Fusedthe use o
this paper: 1) large number of unknown
Similarity param
computationally expensive; 2) it isNetwork
only suitable
tially be applied to the METABRIC cohort whi
the approach is not scalable to the full size of th
2. Combine networks
6
6.1
Patient
Patient similarity:
Supplementary Methods
Stopping Criteria
SNF is proved to converge, and empirically it co
Wt k
in consecutive rounds Et = kWt+1
. ≤ One
10-6 si
kWt k
✏ = 10 6 and if the relative change is lower t
empirical observations about the convergence ca
mRNA-based
Supported by all data
when
the numberDNAofMethylation-based
iterations exceeds
20, it is
process a patient is always most similar to himself than to other
Given m views from di↵erent domains, we can construct similarity matrice
ensure that our final network is full rank, important
for (j)
the class
(j)
and W (j) using
Eq
4
for
the
j-th
view,
j
=
1,
.
.
.
,
m.
P
and
P
are ob
clustering applications of the final network. Finally, we have found
from Eqs 3 and
5 respectively.
of regularization
leads to quicker convergence of CrDP.
Below we introduce
network
fusion
Cross-Di↵usion
The input our
to our
algorithm
can be
feature vectors,Process
pairwise(C
(2)
First, we calculate
status matrices
P (1) status
and Pmatrix
as in
Eqcan
3 from
pairwisethe
similarities.
The learned
P (c)
then tw
be
(1)
(2)
trieval, clustering,
classification;
in thisand
paper,
focus
on cl
similarity matrices;
then the and
kernel
matrices P
P weare
obtaine
(1)
(2)for more
(2) details.
to P[3]
Eq 5. Let P0refer
= readers
P (1) and
=
P
. The cross-di↵usion process is defi
0
Network Fusion
(1)
(2)
Pt+1 = P (1) ⇥ (Pt ) ⇥ (P (1) )0
4 Extension
to m > 2 (1)
Fusing 2 networks:
(2)
Pt+1 = P (2) ⇥ (Pt ) ⇥ (P (2) )0
We extend the CrDP above to multiple (m > 2) similarity matrice
adjusting Eq (6) as follows
Fusing m networks:
(i)
Pt+1
=
P
(i)
⇥(
1
m
1
X
j6=i
(j)
Pt ) ⇥ (P (i) )0 + ⌘I
where i = 1, . . . , m. The corresponding final status matrix is comput
Pm
(i)
1
P
i=1 t .
m
Experiments
Data:!
"2 simulations"
"5 TCGA cancers"
"METABRIC (Large "
Breast Cancer db)"
"
Comparative Methods:!
"Concatenation"
"iCluster"
"PDSB"
"Multiple kernel learning"
"
Criteria: !!
"
"
-log10(log rank pvalue)"
"
Silhouette score (cluster homogeneity)"
"
Running time"
Simulation 1 – complementarity
Simulation 2 - removing noise
Simulation 2 - removing noise
TCGA Data
Gene pre-selection across cancers
Bo Wang
Clustering of the network
Bo Wang
Patient networks:
advantages and disadvantages
- 
- 
- 
Integrative feature selection
Growing the network requires extra work
Unsupervised – hard to turn into a supervised problem
ü  Creates
a unified view of patients based on multiple
heterogeneous sources
ü  Integrates gene and non-gene based data
ü  No need to do gene pre-selection
ü  Robust to different types of noise
ü  Scalable
Package on CRAN: SNFtool
Data integration - future
Data integration - future
—  Simultaneous
feature selection and data
integration
—  Supervised vs unsupervised approaches –
do we really need unsupervised methods?
—  Priors on contributions of different types
of data
—  Automate feature pre-selection if
necessary
Next class
— 
iCluster – joint latent variable model (Shen et
al, 2009) - Ladislav
— 
PARADIGM – Andrew
— 
Next topic: pharmacogenomics (guest lecture
by Dr Benjamin Haibe-Kains)