Download gene_expression

Document related concepts

X-inactivation wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

History of genetic engineering wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Metagenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Pathogenomics wikipedia , lookup

NEDD9 wikipedia , lookup

Public health genomics wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Gene therapy wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene wikipedia , lookup

Genome evolution wikipedia , lookup

Gene desert wikipedia , lookup

Genome (book) wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene nomenclature wikipedia , lookup

Ridge (biology) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
4. Gene Expression Data Analysis
EECS 600: Systems Biology & Bioinformatics
Instructor: Mehmet Koyuturk
4. Gene Expression Data Analysis
Analyzing Gene Expression Data
Clustering


How are genes related in terms of their expression under
different conditions?
Differential gene expression


Which genes are affected by change in condition, tissue,
disease?
Classification (supervised analysis)



Given expression profile for a gene, can we assign a function?
Given the expression levels of several genes in a sample, can
we characterize the type of sample (e.g., cancerous or normal)?
Regulatory network inference


2
How do genes regulate each others expression to orchestrate
cellular function?
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Clustering
Group similar items together
Clustering genes based on their expression profiles




We can measure the expression of multiple genes in multiple
samples
Genes that are functionally related should have similar
expression profiles
Gene expression profile



3
A vector (or a point) in multi-dimensional space, where each
dimension corresponds to a sample
Clustering of multi-dimensional real-valued data is a wellstudied problem
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Motivating Example
Expression levels of 2,000 genes in 22 normal and 40 tumor
colon tissues (Alon et al. , PNAS, 1999)
4
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Applications of Clustering
Functional annotation


If a gene with unknown function is clustered together with
genes that perform a particular function, then that is likely to
be associated with that function
Identification of regulatory motifs


If a group of genes are co-regulated, then it is likely that their
regulation is modulated by similar transcription factors, so
looking for common elements in the neighborhood of the
coding sequences of genes in a cluster, we can identify
regulatory motifs and their location (promoters)
Modular analysis

5
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Gene Expression Matrix
n samples

m genes

Generally, m >> n
 m = O(103)
 n = O(101)
Each row is an n-dimensional
vector
 Expression profile
E  [eij ], 1  i  m, 1  j  n

ei  [ei1 , ei 2 ,..., ein ]T
6
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Proximity Measures
How do we decide which genes are similar to each
other?
Euclidian distance


 
 
Euclidian (ei , e j )  ei  e j
2

n
2
(
e

e
)
 ik jk
k 1
Manhattan distance

n
 
 
Manhattan (ei , e j )  ei  e j   | eik  e jk |
1
7
k 1
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Distance
Minkowski distance


General version of Euclidian, Manhattan etc.
 
 
Minkowski(ei , e j )  ei  e j

p is a parameter
 
ei  e j
8

p

n
p
(
e

e
)
 ik jk
k 1
 max eik  e jk
1 k  n
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Normalization
If we want to measure the distance between directions
rather than absolute magnitude, it may be necessary to
standardize mean and variation of expression levels for
each gene


1 n
i   (ei )   eik
n k 1

1 n
2
 i   (ei ) 
(
e


)
 ik i
n k 1
'
eik  i
'
'
' T
'
ei  [ei1 ,ei 2 ,...,ein ] , eik 
i
9
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Correlation



The similarity between the variation of two random
variables
A vector is treated as sampling of a random variable
Covariance
 
1 n
Cov[ei , e j ]   (eik  i )(e jk   j )
n k 1

 
2
Var[ei ]  Cov[ei , e j ]   i
10
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Pearson Correlation Coefficient

Pearson correlation coefficient
n
 
(eik  i )(e jk   j )

Cov[ei , e j ]
 
k 1
Pearson(ei , e j ) 



 i j
Var[ei ]Var[e j ]


Pearson correlation is equal to the cosine of the angle (or
inner product of) normalized expression profiles
 
 1  Pearson(ei , e j )  1
Pearson correlation is normalized
'  '
 
Pearson(ei , e j )  Pearson(ei , e j )
11
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Euclidian Distance & Correlation

Euclidian distance (normalized) and Pearson correlation
coefficient are closely related
'  '
 
Euclidian (ei , e j )  2n(1  Pearson (ei , e j ))

These are the two most commonly used proximity
measures in gene expression data analysis
 
Without loss of generality, we will use  ij   (ei , e j ) to
denote the distance between two expression profiles

12
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Other Measures of correlation

Pearson is vulnerable to outliers



If two genes have very high expression in a single profile, it
might dominate to show that the two expression levels are
highly correlated
Jackknife correlation: Estimate n correlations by taking each
dimension (sample) out, take the minimum among them
Pearson is not robust for non-Gaussian distributions



13
Spearman’s rank order correlation coefficient: Rank expression
levels, replace each expression level with its rank
More robust against outliers
A lot of loss of information
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Clustering Methods

Hierarchical clustering



Group genes into a tree (a.k.a,
dendrogram), so that each branch
of the tree corresponds to a
cluster
Higher branches correspond to
coarser clusters
Partitioning

14
Partition genes into several
groups so that similar genes will
be in the same partition
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Hierarchical clustering

Direction of clustering



Bottom-up (agglomerative): Start from individual genes, join
them into groups until only one group is left
Top-down (divisive): Start with one group consisting of all
genes, keep partitioning groups until each group contains
exactly one gene
Agglomerative clustering is computationally less expensive


Why?
Hierarchical clustering methods are greedy

15
Once a decision is made, it cannot be undone
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Agglomerative clustering



Start with m clusters: Each cluster contains one gene
At each step, choose two clusters that are closest (or
most correlated), merge them
How do we evaluate the distance between two clusters?

Single-linkage: If clusters contain two very close genes, than the
clusters are close to each other
 (Ck , Cl )  min ( ij )
iCk , jCl
16
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Agglomerative Clustering

Complete linkage: Two clusters are close
to each other only if all genes inside them
are close to each other
 (Ck , Cl )  max ( ij )
iCk , jCl

Group average: Two clusters are close to
each other if their centers are close to
each other
1
 (Ck , Cl ) 
Ck Cl
17
 
iCk jCl
ij
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Divisive Clustering

Recursive bipartitioning




May be computationally expensive




Find an “optimal” partitioning of the genes into two clusters
Recursively work on each partition
Since the number of clusters is an issue for partitioning based
clustering algorithms, the magic number 2 solves a lot of
problems
The problem is “global”
At every level of the tree, we have to work on all of the genes
If tree is imbalanced, there might be as many as m levels
With a reasonable stopping criterion, maybe considered a
partition-based clustering as well
18
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Partition Based Clustering


Find groups of genes such that genes in each group are
similar to each other, while being somewhat less similar
to those in other clusters
Easily interpratable

19
Especially, for large datasets (as compared to hierarchical)
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Number of Clusters





Clustering is “unsupervised”, so generally we do not have
prior knowledge on how many clusters underly the data
It is very difficult to partition data into an “unknown”
number of clusters
Most algorithms assume that K (number of clusters) is
known
Try different values of K, find the one that results in best
clustering
Very expensive
20
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Overlapping vs. Disjoint Clusters

Genes do not have a single function



Most genes might be involved in different
processes, so their expression profiles might
demonstrate similarities with different genes
in different contexts
Can we allow a gene to be included in more
than one cluster?
Allowing overlaps between clusters poses
additional challenges

21
To what extent do we allow overlaps? (We
definitely don’t want to identify two identical
clusters)
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Fuzzy Clustering

Assign weights to each gene-cluster pair, showing the
extent (or likelihood) of the gene belonging to the cluster




22
Difficult interpretation
Partitioning is a special case of fuzzy clustering, where the
weights are restricted to binary values
Hierarchical clustering is also “fuzzy” in some sense
Continuous relaxation might alleviate computational
complexity as well
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
K-Means Clustering


The most famous clustering algorithm
Given K, find K disjoint clusters such that the total
intracluster variation is minimized

1
Cluster mean:  k 
Ck

 ei
iC k
 
Intracluster variation:  k    (ei , i )
iCk
K
Total intracluster variation:     k
k 1
23
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
K-Means Algorithm
K-Means is an iterative algorithm that alters parameters
based on each other’s values until no improvement is
possible
1. Choose K expression profiles randomly, designate each of
them as the center of one of the K clusters
2. Assign each gene to a cluster
2.1. Each gene is assigned to the cluster with closest center to its
profile
3. Redetermine cluster centers
4. If any gene was moved, go back to Step 2, else stop
24
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Sample Run of K-Means
25
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Self Organizing Maps

Just like K-means, we have K clusters, but this time they
are organized into a map




Just like K-means, each cluster is associated with a weight
vector


Often a 2D grid
We want to organize clusters so that similar clusters will be in
proximity in the map
A way of visualizing in low-dimensional (2D) space
It was the cluster center in K-means
Each weight vector is first initialized randomly to some
gene’s expression profile
26
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
SOM Algorithm



At each step, a gene is selected at random
The distance between the gene’s expression profile and
each cluster’s weight vector is calculated, and the cluster
with closest weight vector becomes the winner
The winner’s and its neighbors’ (according to the 2D
mapping) weight vectors are adjusted to represent the
gene’s expression profile better




wk (t  1)  wk (t )   (t )(Ck , C j )( wk (t )  ei )


27
Cj is the winner cluster for gene i at time t
α is a decreasing function of time, θ is the neighborhood
function
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Sample SOM Output
28
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Gene Co-expression Network



Nodes represent genes
Weighted edges between nodes represent proximity
(correlation) between genes’ expression profiles
This is indeed a way of predicting interactions between
genes
29
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Graph Theoretical Clustering

Partition the graph into heavy subgraphs



Heuristic algorithms



Maximize total weight (number of edges) inside a cluster
Minimize total weight (number of edges) between clusters
CLICK: Recursive min-cut
CAST: Iterative improvement one by one for each cluster
Loss of information?
30
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Model Based Clustering

Generating model



Each cluster is associated with a distribution (that generates
expression profiles for associated genes) specified by model
parameters
The probability that a gene belongs to a cluster is specified by hidden
parameters
Expectation Maximization (EM) algorithm




31
Start with a guess of model parameters
E-step: Compute expected values of hidden parameters based on
model parameters
M-step: Based on hidden parameters, estimate model parameters to
maximize the likelihood of observing the data at hand, iterate
K-means is a special case
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Evaluation of Clusters


In general, we want to maximize intra-cluster similarity,
while minimizing inter-cluster similarity
Homogeneity, separation


Reference partition




Based on the proximity metric
Information on “true clusters” that comes from a different
source (apart from expression data)
Molecular annotation (e.g., Gene Ontology)
Jaccard coefficient, sensitivity, specificity
Cluster annotation

32
Processes that are significantly enriched in a cluster
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Homogeneity & Separation

Heterogeneity (or homogeneity in reverse direction)

How similar are the genes in one cluster?
H (C ) 

2
 ij

C ( C  1) i , jCk
Separation

How dissimilar are different clusters?
S (Ck , Cl ) 

1
Ck Cl
 
iCk jCl
ij
Good clustering: high heterogeneity, low separation
33
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Overall Quality

Overall heterogeneity
1
H   Ck H (Ck )
m Ck

Overall separation
S
1
 C k Cl
C
k
Cl S (Ck , Cl )
C k , Cl
C k ,Cl

How do these change with respect to number of clusters?

34
Can we optimize these values to choose the best number of
clusters?
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Bayesian Information Criterion

A statistical criterion for evaluating a model

Penalizes model complexity (number of free parameters to be
estimated)

k is the number of free parameters in the model, which
increases with the number clusters
RSS is the “total error” in the model


Trade-off number of clusters and optimization function to
choose the best number of clusters
35
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Reference Partitioning


If there is information about “ground truth” from an
independent source, we can compare our clustering to
such reference partitioning
Pairwise assessment


Let Cij = 1 if gene i and gene j are assigned to the same cluster
by the clustering algorithm, 0 otherwise
Let Rij = 1 if gene i and gene j are in the same cluster according
to reference partition
n11   Cij  Rij
n00   (Cij  Rij )
n01   Cij  Rij
n10   Cij  Rij
i, j
i, j
36
i, j
i, j
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Comparing Partitions

Rand index (symmetric)
Rand 

n11  n00
n11  n00  n10  n01
Jaccard coefficient (sparse)
Jaccard 

n11
n11  n10  n01
Minkowski measure (sparse)
n10  n01
Minkowski 
n11  n01
37
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Cluster Annotation

Clustering results in groups of genes that are coexpressed (or co-regulated)


We have partial knowledge on the function of many
individual genes


For each group, can we tell something about the biological
phenomena that underlies our observation (their coexpression)?
Gene Ontology, COG (Clusters of Ortholog Groups), PFAM
(Protein Domain Families)
Taking a statistical approach, we can assign function to
each group of genes

38
A function popular in a cluster is associated with that cluster
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Gene Ontology


Ontology: Study of being (e.g., conceptualization)
Gene Ontology is an attempt to develop a standardized
library of cellular function


Unified view of life: Processes, structures, and functions recur
in diverse organisms
Three concepts of Gene Ontology



39
Biological process: A recognized series of events or molecular
functions (e.g., cell cycle, development, metabolism)
Molecular function: What does a gene’s product do? (e.g.,
binding, enzyme activity, receptor activity)
Cellular component: Localization within the cell (e.g.,
membrane, nucleus, ubiquitin ligase complex)
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Hierarchy in Gene Ontology

Gene Ontology is hierarchical

A process might have subprocesses


A process might be described at different levels of detail



Seed maturation is part of seed development
Seed dormation is a(n example of) seed maturation
Same for function and component
Gene Ontology terms are related to each other via “is
a” and “part of” relationships


40
If process A is part of process B, then A is B’s child (B is A’s
parent); B involves A
If function C is a function D, then C is D’s child; C is a more
detailed specification of D
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
41
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
GO Hierarchy is a DAG

Gene Ontology is
hierarchical, but the
hierarcy is not represented
by a tree, it is represented
by a directed acyclic graph
(DAG)

42
A GO term can have
multiple parents (and
obviously a GO term might
(should?) have multiple
children)
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Annotation

GO-based annotation assigns GO terms to a gene



True-path rule


A gene might have multiple functions, can be involved in
multiple processes
Multiple genes might be associated with the same function,
multiple genes take part in a process
If a gene is annotated with a term, then it is also annotated by
its parents (consequently, all ancestors)
How does the number of genes associated with each
term changes as we go down on the GO DAG?
43
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
GO Annotation of Gene Clusters




There a |C| genes in a cluster C
|T| genes are associated with GO term t
|C ∩ T| genes are in C and are associated with t
What is the association between cluster C and term t?



If we chose random clusters, would we be able to observe that
at least this many (|C ∩ T|) of the |C| genes in C are associated
with t?
What is the probability of this observation?
Statistical significance based on hypergeometric
distribution
44
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Hypergeometric Distribution


We have n items, m of which are good
If we choose r items from the entire set of items at
random, what is the probability that at least k of them will
be good?
 m  n  m 
 

min( m , r ) 
i  r  i 

p  P[ K  k ]  
n
i k
 
r



n is the number of genes in the organism
m=|T|, r=|C|, k= |C ∩ T|
The lower p is, the more likely that there is an underlying
association between the term and the cluster (the term is
significantly enriched in the cluster)
45
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
GO Hierarchy & Cluster Annotation

How specific (general) is the annotation we attach to a
cluster?




If a cluster is larger, then it might correspond to a more
general process
Some processes might be over-represented in the study set
How do we find the best location of a cluster in GO hierarchy?
Parent-child annotation


46
Condition probability of enrichment of a term in a cluster on
the enrichment of its parent terms in the cluster
The gene space is defined as the set of genes that are
associated with t’s parents
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Parent-Child Annotation
47
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Multiple Hypotheses Testing

The p-value for a single term provides an estimate of the
probability of having the observed number of genes
attached to that particular term




We have many terms, even if the likelihood of enrichment is
small for a particular term, it might be very probable that one
term will be enriched as much as observed in the cluster
We have to account for all hypotheses being tested
simultaneously
Bonferroni correction: Apply union rule, add all p-values
Which terms should we consider while correcting for
multiple hypotheses for a single term?
48
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Representativity of Terms

How good does a significantly enriched term represent a
cluster?



How many of the genes in the cluster are attached to the
term?
How many of the genes attached to the term are in the
cluster?
For term t that is significantly enriched in cluster C


49
Specificity: |C ∩ T|/|C|, a.k.a. precision
Specificity: |C ∩ T|/|T|, a.k.a. recall
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Biclustering

A particular process might
be active in certain
conditions


50
A group of genes might be
expressed (or up-regulated,
supressed, co-regulated,
etc.) in only a subset of
samples
They might behave almost
independently under other
conditions
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Clustering vs. Biclustering

Clustering is a global approach



Each gene is a point in the space defined by all samples
How about points that are clustered in a subspace?
Biclustering: While clustering genes, also choose a set of
dimensions (samples) that provides best clustering



51
and vice versa
a.k.a, co-clustering, subspace clustering…
This is a much harder problem, because you are not only trying
to find groups of points that are close to each other in multidimensional space, but also trying to identify a subspace in
which groups are more evident
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Biclustering Applications

Sample/tissue classification for diagnosis


Identification of co-regulated genes


The samples with leukemia show specific characters for a
subset of genes
Certain sets of genes exhibit coherent activations under
specific conditions (while behaving more or less arbitrarily with
respect to each other under other conditions)
Functional annotation


52
Biological processes, functional classes are overlapping
Different sets of samples reveal different functional
relationships
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Biclustering Principles


A cluster of genes is defined with respect to a cluster of
samples and vice versa
The clusters are not necessarily exclusive or exhaustive



A gene/condition may belong to more than one cluster
A gene/condition may not belong to any cluster at all
Biclusters are not “perfect”


53
Noise
Statistical inference becomes particularly important
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Biclustering Formulation



Given a gene expression matrix A with gene set G and
sample set S, a bicluster is defined by a subset of genes I
and a subset of samples J
General idea: A bicluster is a “good” one if AIJ , the
submatrix defined by I and J, has some coherence (low
variance, low rank, similar ordering of rows, etc.)
The biclustering problem can be defined as one of finding
a single bicluster in the entire gene expression matrix, or
as one of extracting all biclusters (with some restriction
on the relationship between biclusters)
54
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Coherence of a Submatrix
55
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Distribution of Biclusters
56
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Bipartite Graph Model

Just like symmetric matrices, which can be modeled as
arbitrary graphs, rectangular matrices can be modeled
using bipartite graphs

With proper definition of edge weights, biclustering can
be posed as the problem of finding “heavy” subgraphs
57
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Row, Column, Matrix Means
58
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Objective Function

Low-variance (constant) bicluster



Ideal bicluster:
Minimize bicluster variance
Low-rank (constant row, constant column, coherent
values) bicluster





59
Ideal constant row:
Ideal constant column:
General rank-one bicluster:
Define residue for each value:
Minimize mean squared residue
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Missing Values

Not all expression levels are available for each
gene/sample pair


A solution is to replace missing values (random values, gene
mean, sample mean, regression)
Generalize definition row, column, and bicluster means to
handle missing values implicitly
Occupancy threshold:
A bicluster is one with
adequate number of
(non-missing) values in
each row and column

60
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Overlapping Biclusters


The expression of a gene in one sample may be thought
of as a superposition of contribution for multiple
biclusters
Plaid model:




61
: contribution of bicluster k on the expression value of the
ith gene in the jth sample
and (generally binary) specify the membership of row i
and column j in the kth bicluster, respectively
Minimize
is defined to reflect “bicluster type”
,
,
,
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Discrete Coherence


A bicluster is defined to be one with coherent ordering of
the values on rows and/or columns (as compared to
values themselves)
Order-preserving submatrix (OPSM)


A submatrix is order preserving if there is an ordering of its
columns such that the sequences of values in every row is
increasing
Gene expression motifs (xMOTIFs)


62
The expression level of a gene is conserved across a subset of
conditions if the gene is in the same “state” in each of the
conditions
An xMOTIF is a subset of genes that are simultaneously
conserved across a subset of samples
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Binary Biclusters

Quantize gene expression matrix to binary values



SAMBA: A 1 corresponds to a significant change in the
expression value
PROXIMUS: A 1 means that the gene is “expressed” in the
corresponding sample
A bicluster is a “dense submatrix”, i.e. one with
significantly more number of 1’s than one would expect


63
Bipartite graph model: Bicliques, heavy subgraphs
It is possible to statistically quantify the density of a submatrix

Log-likelihood:

p-value:
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Biclustering Algorithms

Enumeration


Greedy algorithms


Solve problem recursively
Alternating iterative heuristics


Make a locally optimal choice at every step
Divide and conquer


Go for it!
Fix one dimension, solve for other, alternate iteratively
Model Based Parameter estimation

64
e.g., EM algorithm
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Enumerating Biclusters

m rows, n columns in the matrix




2m X 2n possible biclusters in total
Not doable in realistic amounts of time
Is it really necessary?
Put some restriction on size of biclusters


SAMBA models the problem as one of finding heavy subgraphs
in a bipartite graph
Key assumption is sparsity: Nodes of the bipartite graph have
bounded degree


65
Find K heavy bipartite subgraphs (biclusters) with bounded degree
enumeration
Refine them to optimize overlap and add/remove nodes that improve
bicluster quality
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Greedy Algorithms

Basic idea: Refine existing biclusters by adding/removing
genes/samples to improve the objective function




Generally, quite fast
How to choose initial biclusters?
How to jump over bad local optima? (Global awareness, Hillclimbing)
Optimization function: mean-squared residue



66
Node deletion: Start with a large bicluster, keep removing
genes/samples that contribute most to total residue
Node addition: Start with a small bicluster, keep adding
genes/samples that contribute least to total residue
Repeat these alternatingly to improve global awareness
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Finding All Biclusters

If biclusters are identified one by one, we should make
sure that we do not identify the same bicluster again and
again



Masking discovered biclusters: Fill bicluster with random values
First identify disjoint biclusters, then grow them to capture
overlaps
Flexible Overlapped Biclustering (FLOC)


67
Generate K initial biclusters
Make decision from the gene/sample perspective (as compared
to bicluster perspective): Choose the best (maximum gain)
action for each gene
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Generalizing K-Means to Biclustering

Assume K gene clusters, L sample clusters


R: mxk gene clustering matrix, C: nxl sample clustering
matrix


Notice that this is a little counter-intuitive, we do not have
well-defined biclusters, we rather have clusters of genes and
samples, and each pair of gene and sample clusters defines a
bicluster
R(i,k)=1 if gene i belongs to cluster k (actually, columns are
normalized to have unit norm)
Minimize total residue:
68
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
KL-Means Algorithm


We can show that
Batch iteration





Given R, compute
(mxl matrix) serves as a prototype for column clusters
For each column, find the column of
that is closest to that
column, update the corresponding entry of C accordingly
Once C is fixed, repeat the same for rows to compute R from
Converges to a local minimum of the objective function
69
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
OPSM Algorithm


Recall that an order preserving submatrix (OPSM) is one
such that all rows have their entries in the same order
Growing partial models




70
Fix the extremes first
The idea: Columns with very high or low values are more
informative for identifying rows that support the assumed
linear order
Start with all (1,1) partial models, i.e., only consider the
preservation of the first and last elements, keep the best ones
Expand these to obtain (2,1) models, then (2,2) until we have
(s/2, s/2) models, s being the number of columns in target
bicluster
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Divide and Conquer Algorithms

Block clustering (a.k.a., Direct clustering)






Recursive bipartitioning
Sort rows according to their mean, choose a row such that the
total variance above and below the row is minimized
Do the same for columns
Pick the row or column that results in minimum intra-cluster
variances, split matrix into two based on that row or column
Continue splitting recursively
One problem is that once two rows/columns go to
different biclusters, they can never come together

71
Gap Statistics: Find a large number of biclusters, then
recombine
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Binormalization


Normalize matrix on both dimensions
Independent scaling of rows and columns



Bistochastization



Here, R and C are diagonal matrices that contain row and
column means, respectively
Goal: Rows will add up to a constant (or will have constant
norm), columns will add up to a separate constant
Repeat independent scaling of rows and columns until stability
is reached
The residual of entire matrix is also normalized in the
sense that both rows and columns have zero mean
72
EECS 600: Systems Biology & Bioinformatics
4. Gene Expression Data Analysis
Spectral Biclustering

Singular value decomposition




The eigenvalues of the matrices ATA and AAT (say, σ2) are the
same
Each σ is called a singular value of A and the corresponding left
and right eigenvectors are called singular vectors
If σ1 is the largest singular vector of A such that ATAv1 = σ1v1
and AATu1 = σ1u1 , then σ1u1v1T is the best rank-one
approximation to A, i.e., ||A- σuvT ||2 is minimized by σ1 , u1 ,
and v1 (over all orthogonal vector pairs with unit norm)
Consequently, the entries of u and v are ordered in such a
way that similar rows have similar values on u, similar
columns have similar values on v

73
Split matrix based on u and v
EECS 600: Systems Biology & Bioinformatics