Download Microarray expression data

Document related concepts

X-inactivation wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Copy-number variation wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Genetic engineering wikipedia , lookup

Oncogenomics wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Pathogenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Minimal genome wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Public health genomics wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Gene therapy wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genomic imprinting wikipedia , lookup

NEDD9 wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene wikipedia , lookup

Gene desert wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome evolution wikipedia , lookup

Ridge (biology) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Aspects of microarray gene expression analysis
Project 786 - 102
Spring 2002
Antoaneta Vladimirova
Parts of the talk:
1. Why microarray expression experiments?
2. Basic steps of the microarray experiment
3. Data collection and normalization
4. Analysis of expression data - Clustering algorithms
5. “Extraction of correlated gene clusters by multiple graph
comparison” - an algorithm that allows integration of the
expression data with other existing biological knowledge
Reductionism in Biology
Studies in biology until now:
whole---> parts
Assumption: knowledge derived from the parts will enable us
to understand the whole
organism-->organs-->cells-->molecules
Consequences of this approach:
- incomplete knowledge
- isolated studies: individual genes or gene products
- databases - inconsistent annotations, lack of integration
Biological information flow:
In general:
DNA----> RNA ---->protein
copy the genetic information
genome : collection of all genes(DNA) of an organism
Transcription (gene expression)
RNA - a messenger molecule from genetic information to
functional unit
Protein: gene product
carries the function of the gene
Why study gene expression?
* every cell in an organism has the same set of genes. So, what makes the liver cells
and the brain cells different?
* cells in different tissues or in different stages of development express different set of
genes and have consequently different characteristics
* gene expression process is the intermediary process between the maintenance of the
genetic information in the form of DNA in the chromosomes and the production of
protein which carries most of the functions in a cell
* our interest is in understanding the functions, properties and inter-relations among
proteins, however, studying gene expression is technologically more affordable, cheaper
and is assumed to give us a good approximation about the quantity of the corresponding
protein product
Why design and analyze microarray experiments?
* allow simultaneous assessment of multiple genes
* generate expression levels of thousands of genes in parallel
* expression level of the gene is approximated to the protein level,
and, respectively, to the function a gene product carries
* expression level changes due to environmental conditions, developmental stage,
diseased state
* genes that share expression patterns are assumed to be co-regulated
and to be functionally related
* gene expression data might eventually allow us to reverse-engineer
(reconstruct gene regulation networks and biological processes)
Synthetic approach
* advances in technology allow us to move on to synthetic approach:
* take all pieces together, integrate vs. disassemble the biological
system.
* towards reconstruction of the whole cell/organism
* need to study gene/gene product not in isolation, but relevant to all
other genes/products and the environment networks of components and interactions between them
Microarray System
Adapted from “Ratio-based decisions and the quantitative analysis of cDna microarray images” -Chen, Dougherty and Bittner (1997), J
Biomed Opt 2(4)
Microarray images
a. Oligonucleotide array synthesized in situ with photochemical technology by Affymetrix.
b. Oligonucleotide array synthesized in situ with
ink-jet technology (Rosetta Inpharmatics).
c. DNA microarray printed on a glass slide (Corning, Inc).
Adapted from “Biomedical Discovery Review with DNA Arrays” - Richard A. Young
Color-coded expression
*Each dot on the microarray is read through two independent channels (green and red)
*Green color - means query expression is lower than the control expression
Query signal of gene x/Control signal of gene x < 0
*Red color - means query expression is higher than the control expression
Query signal of gene x /Control signal of gene x> 0
*Yellow color - means query expression and control expression are equal
Query signal of gene x/Control signal of gene x = 1
*Black color - neither control or query bound to the slide
Why signal normalization is necessary?
Assumptions:
*the quantity of initial RNA from both samples is equal
*some genes are up-regulated, others are down-regulated, but overall
these changes should balance out so that the total quantity from each sample that
hybridizes to the array is equal, therefore the total intensity read through the red and
green channels should be the same
The relative fluorescence intensities need to be normalized because:
* we need to adjust for differences in labeling and detection
efficiencies of the different fluorescent labels
* we need to adjust for differences in the quantity of initial RNA
isolated from the query and control samples
*need to compensate for experimental variability
* a normalization factor is computed and applied for each gene
How to interpret the raw microarray data?
Microarray 1
(gene 1- gene m)
gene 1
Microarray i
(gene 1- gene m)
….
Microarray n
(gene 1- gene m)
….
gene m
Experimental
conditions 1
(e.g. nutrients
withdrawal)
Experimental
conditions i
(e.g. gene disruption)
Experimental
conditions n
(e.g. drug
treatment)
Fluorescence intensities are translated to a ratio Q/C
Data is organized into an Expression matrix
...
gene 1
gene 2
gene 3
.
.
.
.55 0.40 2.34
.37
0 .12
0.77
.59
gene m
.19
Gene 1 is differentially expressed in exp. condition 3
Expression signal is represented as a ratio
*Red color -> Q/C > 0
If Q/C in the range of 1.5 - 2.0 the particular gene in the query cell is considered upregulated. In theory: Q/C can go to infinity.
*Green color -> Q/C< 0
The particular gene in the query cell is considered down-regulated. In theory: Q/C will
range from zero to one.
0
Green 1
Red
To correct for that, log2(Q/C) is used.
If Q/C = 2
==> log2(Q/C) = 1
If Q/C = 1==> log2(Q/C) = 0
If Q/C = 1/2 ==> log2(Q/C) = -1
Over-expressed and inhibited genes values are equally distributed
Green
0
Red
+infinity
Expression vectors and expression space
2D expression space
Expression matrix
0.9
0.8
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
Gene 9
Gene 10
Exp. 1
Exp. 2
0.55
0.33
0.45
0.55
0.23
0.76
0.24
0.34
0.11
0.77
0.67
0.45
0.9
0.33
0.4
0.12
0.77
0.37
0.02
0.33
We want to group (cluster) the
expression vectors based on
their “similarity”
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
0.9
0.8
0.7
0.6
0.5
0.4
Assumption:
Genes in the same group are
functionally related
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
Experiment vectors and experiment space
5D experiment space
(.55, .45, .23, .24 ,.11)
Expression matrix
Exp. 1
Exp. 2
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
0.55
0.45
0.23
0.24
0.11
0.44
0.33
0.37
0.29
0.88
(.44, .33, .37, .29 ,.88)
(0, 0, 0, 0 ,0)
We want to group (cluster) the
expression vectors based on
their “similarity”
Cluster 1
Assumption:
Genes in the same group are
functionally related
Cluster 2
How to define similarity between expression vectors?
dij
Distance measure:
i
the distance between two objects (e.g. expression vectors)
dik
k
Metric:
1. dij must be positive or zero (dij >= 0)
2. Must be symmetric (dij = dji)
3. An object is zero distance from itself (dii = 0)
4. When considering three objects i, j and k, the distance from i to k
is always less than or equal to the sum of the distance from i to j,
and the distance from j to k ; (dik <= dij + djk )(the triangle rule)
j
dkj
Example : Euclidean distance between two 3D points X(x1, x2, x3) and Y(y1, y2, y3) is
d12 = SQRT ( (x1 - y1)2 + (x2 - y2)2 + (x3 - y3)2)
For an n-dimensional space:
d12 = SQRT (xi - yi)2 where i = 1 to n
Semi-metric:
Obey the first three rules but not the triangle rule
Clustering analysis of gene expression
Idea: cluster together genes with similar expression patterns
Underlying assumptions:
* Genes that share expression patterns are co-regulated and participate
in functionally related processes
* Unknown genes that are clustered together with known genes
might have similar or related functions
Categories of clustering methods:
I. Unsupervised
II. Supervised
A. Agglomerative
B. Divisive
Two major clustering algorithm categories
Agglomerative:
Start with individual gene clusters and gradually accommodate more genes in a cluster,
clusters are eventually joined in one huge cluster;
usually represented by a tree structure resembling the phylogenetic trees
Divisive:
Start with all point into one cluster and gradually form new clusters and distribute the
data points among them
Unsupervised:
No prior knowledge is assumed when forming the clusters
Supervised:
Existing biological knowledge is used to guide the clustering process
Clustering methods to be discussed:
* Hierarchical clustering
* k-means clustering
* Principal Component Analysis
* Supervised clustering (classifiers)
Hierarchical Clustering
* One of the most frequently used techniques
* simple and can be easily visualized as a tree similar to the phylogenetic trees
* an agglomerative approach: single expression profiles are joined to form groups, the
process is repeated until all expression profiles have been joined in one cluster
*first, the pair-wise distances are calculated for all the genes to be clustered; initially
each gene is a cluster itself
Cl1
Cl2
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
0.11
Cl3
Gene 1
0
Gene 2
Gene 3
Gene 4
Gene 5
0.45
0.23
0.24
0.11
0
0.76
0.34
0.77
0
0.44
0.36
0
0.77
0
Cl5
Cl4
* The distance matrix is searched for the clusters with the minimum distance between
them
* The two selected clusters are joined to form a new cluster containing 2 objects now
Hierarchical Clustering
* The distances are recalculated from this new cluster to the rest of the clusters; the
distance matrix now contains one less dimension ( or cluster)
Gene 2
Gene 2
Gene 3
Gene 4
Cluster A
Gene 3
0
0.76
0.34
0.56
0
0.44
0.56
Gene 4
Cluster A
0
0.45
Cl3
0
Building the hierarchical tree:
Cl1
ClA
Cl2
Cl4
Cluster A
Cl5
* The process is repeated and Cl2 and Cl2 are joined into Cluster B
* The distances are recalculated and clusters formed until only one cluster is left
that accommodates all the objects
Gene 3
Gene 3
Cluster A
Cluster B
Cluster A Cluster B
0
0.56
0.44
0
0.67
0
Cl3
Building the hierarchical tree:
Cl 1
Cl A
Cl 5
Cl 2
Cl 4
Cl B
Cluster A
Cluster B
Hierarchical Clustering
Cluster A
Cluster C
Cluster A Cluster C
0
0.56
0
Building the hierarchical tree:
Cl 1
Cl A
Cl3
Cl 5
Cl 2
Cl 4
Cl 3
Cl2
Cl B
Cl C
Cl4
Cluster A
Cluster C
Cluster D
Cluster D
0
Building the hierarchical tree:
Cl 1
Cl 5
Cl 2
Cl 4
Cl2
Cl3
Cl A
Cl B
Cl C
Cl D
Cl4
Cl 3
Cluster D
Hierarchical Clustering - Tree Representation Example (partial tree)
Hierarchical clustering of gene expression matrices. The image shows an average linkage (UPGMA) clustering of 505 yeast genes duri
three different cell cycle studies with a total of 60 different time points analyzed. The color image on the left shows the numerical values
by color according to the method introduced by Mike Eisen. Red is used to represent the positive values and green the negative values
Blue shows the missing values in the respective experiments. The clustering and the image are produced using WWW-based tools in E
Pro¢ler (http://www.ebi.ac.uk/microarray/). The interface is interactive and further information about the genes in each subtree is availab
clicking on the respective nodes in the tree.
Adapted from “Gene expression data analysis” - A,Brazma and J. Vilo (2000) FEBS Letters
Hierarchical Clustering Algorithms
* Single-linkage clustering:
the distance between two clusters I and j is calculated as the minimum
distance between a member of cluster i and a member of cluster j
i
j
* Complete-linkage clustering:
the distance between two clusters I and j is calculated as the maximum
distance between a member of cluster i and a member of cluster j
i
j
* Average-linkage clustering:
average values are used for calculating the distance
i
j
K-means Clustering
* Partitions data in groups with similar expression
* there should be advanced knowledge about the number of clusters or
k should be chosen arbitrarily; objects are partitioned into a fixed number of clusters,
such that clusters are internally similar but externally dissimilar
* each time the same k might produce slightly different clustering results
* the process is conceptually simple, but can be computationally intensive
* first, the objects are randomly partitioned into k user-specified clusters
* an average expression vector is computed which represents each cluster and is used to
compute the distances between each point and each average cluster vector
* if a given object is closer to a different cluster that to the one it is assigned to, it is reassigned to the closest cluster and the average expression vector for the clusters is
recalculated.
* the process is repeated until no re-assignments are necessary.
K-means Clustering
* Let’s have m objects in n-dimensional space
(e.g. five genes in 2D expression space)
1
2
3
5
4
* Let k = 2
* Let’s partition arbitrarily into 2 clusters
* Then calculate the average expression vector for each cluster
* Calculate distances from each object to the average vector
1
2
3
5
* Re-assign object 3 to Cluster A
4
* Recalculate average expression vectors for the clusters
* Re-calculate the distances from all objects to all average expression vectors
* No further re-assignments are necessary
1
* This are the final 2 clusters
2
3
5
4
Principal Component Analysis (PCA)
* Principal Components Analysis or Singular Value decomposition is a
mathematical
technique that picks up patterns in data while reducing
dimensionality
* reduction of dimensionality might be necessary when some of the data might
contain redundant information, e.g. if a group of experiments are more closely
related that initially expected
* “projects” complex data onto a reduced, easily visualized space
* analogy: a 3D cloud of data points that is rotated so that one can see it from
different perspectives; some views might allow a better separation of the data into
groups than other views ;PCA finds the best views to separate the data
* in most implementations of PCA it is difficult to define the precise boundaries of
distinct clusters in the data, or to define genes(or experiments) belonging to each
cluster
* however, when combined with another clustering techniques such as k-means, it
becomes a very powerful technique
Analysis of a demonstration data set
* the performance of the various algorithms is compared
* the analysis can help to provide an understanding of how the data are handled and
interpreted by the different methods
A synthetic gene-expression data set.
This data set provides an opportunity to evaluate how various clustering algorithms reveal different features
of the data.
A. Nine distinct gene-expression patterns were created with log2(ratio) expression measures defined for ten
experiments.
B. For each expression pattern, 50 additional genes were generated,
representing variations on the basic patterns.
Adapted from Computational analysis of microarray data” - J. Quackenbush (2001), Nature Genetics,vol 2
Hierarchical Clustering Algorithms
Genes in the demonstration data set were subjected to
a. average-linkage
b. complete-linkage
c. single-linkage hierarchical clustering using a
Euclidean distance metric and gene-expression
families (A–J) that were color coded for comparison.
Genes that are up-regulated appear in red, and those
that are down-regulated appear in green, with the
relative log2(ratio) reflected by the intensity of the
color. This method of clustering groups genes by
reordering the expression matrix allows patterns to be
easily visualized.
Adapted from Computational analysis of microarray data” - J. Quackenbush (2001), Nature
Genetics,vol 2
Hierarchical Clustering and PCA
Principal component analysis. The same
demonstration data set was analyzed using
a. hierarchical (average-linkage) clustering and b.
principal component analysis using
Euclidean distance, to show how each treats the
data, with genes color coded on the basis
of hierarchical clustering results for comparison.
Adapted from Computational analysis of microarray data” - J. Quackenbush (2001), Nature Genetics,vol 2
Data Filtering by Mean Centering
*Why filtering? To enhance certain features of the patterns we are looking for
*Mean Centering removes “constant” expression by subtracting the average across all
experiments from each data point
* genes with similar changes relative to their baseline expression pattern are grouped
*A, B and C have “constant” expression - grouped together (originally B were
up-regulated and C were down-regulated)
* D and G are grouped together - expression changes in the same fashion (up and down)
* E and F are grouped together - expression changes in the same fashion (down and up)
Adapted from Computational analysis of microarray data” - J. Quackenbush (2001), Nature Genetics,vol 2
The effect of Data Filtering by Mean Centering
The effect of data filtering.
Application of various data filters or changes in
the distance metric can change the results
derived from any clustering algorithm.
A. Mean centering of the data removes ‘constant’
expression, which reveals changes in expression
patterns for the nine gene families across the ten
experiments. The changes can be seen in the results
of
b. principal component analysis
c. average-linkage hierarchical clustering.
Adapted from Computational analysis of microarray data” - J. Quackenbush (2001), Nature
Genetics,vol 2
Supervised clustering (classifiers)
* Supervised methods can be applied if one has some previous knowledge of which
genes are expected to cluster together
* Support Vector Machine (SVM) - a widely used technique
* SVM uses a training set of genes known to be related e.g. functionally; the training
set is provided as positive members and genes known not to be related are used as
negative examples
* this training of the SVM allows it to distinguish between members and non-members
of the group based on the expression data; SVM uses existing biological relationships to
determine expression features that are characteristic for a group
Supervised clustering (classifiers)
* the SVM is used then to recognize and classify the genes in the data set to the
established groups on the basis of their expression
* the SVM can also identify genes in the training set that are outliers or that have
been previously assigned to the incorrect class
* an application of potentially great impact is classification of samples from
patients affected by some disease; if there is information on expression patterns
that is already correlated with survival data or disease-stage or disease-type, that
can be applied to train the SVM to classify samples for cancer diagnostics, for
example. In many cases samples look the same histologically, but their
“expression fingerprint” is different. A certain “expression fingerprint” might be
correlated with different rates of progression of the disease or to its response to
treatment with various drugs
Clustering/Classification of expression data
Problems:
* how to normalize expression values?
* what distance metric to use?
* results are very much dependent on the approach taken for the
analysis
Algorithm limitations:
* take into account only expression profiles
* does not incorporate all the biological information out there
* clustering unrelated data will still produce clusters!
* there is no “best” or “correct” clustering technique - the results have
to be evaluated in the context of the existing biological knowledge
“Extraction of Correlated Gene Clusters by Multiple Graph
Comparison”
Akihiro Nakaya
Susumu Goto
Minoru Kanehisa
Bioinformatics Center, Kyoto University, Japan
Genome Informatics 12: 44-53 (2001)
Graphs
Edges
A
Vertices(Nodes)
B
C
D
G = (V, E)
E
F
B
A
G - graph
V- vertices
E - edges
Comparisons of Graphs (common sub-graph)
A
B
A
C
Genome
B
C
Linear genome
B
B
D
E
A
C
D
E
Genome
Pathway
G
F
Comparisons of Graphs
A
B
A
C
Cluster
B
C
Linear genome
B
B
D
E
Pathway
A
C
D
E
Pathway
G
F
The KEGG Databases
Database Data Object
(graph)
Node
Edge
Content
GENES
Genome
Gene
Adjacency
Gene catalogs of completely sequenced
genomes and some partial genomes
SSDB
Protein Universe
Protein
Sequence
similarity
Ortholog/Paralog relations of all
protein-coding genes in complete genomes
PATHWAY Network Gene
Generalized Generalized Generalized protein interaction network
product
protein
interaction (pathways and complexes) involving
or
interaction various cellular processes
subnetwork
LIGAND Chemical Universe
Compound Reaction
EXPRESSION Transcriptome
Gene
Expression Microarray gene expression profiles
similarity
BRITE
Protein
Direct
interaction
Proteome
Chemical Compounds and chemical reactions that
are relevant to cellular processes
Protein-protein interactions and relations
Pathway Database
* network of gene products (nodes) with three types of interactions
or relations (edges)
- enzyme-enzyme relations (catalyzing successive reaction steps
in the metabolic pathway
- protein-protein interactions (e.g. binding, phosphorylation)
- gene expression relations (transcription factors and target gene
products)
* 5761 entries ( as of Sept 2001)
- 201 reference pathway diagrams
- 83 ortholog group tables
- 960 enzyme-enzyme relations
One of the really fundamental problems in biology:
* there is a fraction of genes with known functions, however, the majority of genes
have not been assigned a function even if the particular genome has been already
sequenced.
* how to find gene functions or genes/gene products with related functions from
all the information obtained from the sequencing, expression profiling or proteinprotein interaction assays??
* Techniques:
*Clustering of expression microarray data
*Classification of expression microarray data
*Multiple graph comparison
Goal:
*extract a set of correlated genes with respect to multiple biological
features
Method:
Relationships among genes on a specific feature are encoded as a graph structure
where nodes correspond to genes (or gene products). This might suggest a functional
link between genes.
Genome
(gene cluster)
Pathway
(enzyme cluster)
Expression
(co-expressed genes)
Correlated Gene Clusters
* if all or most of the genes from different graphs reserve their mutual relationships in
multiple graphs, the biological relevance among these genes is considered to be
supported at high possibility
* can be used to characterize, classify or predict activities of genes
* finding clusters in different graphs is actually finding common sub-graphs among
them
* belongs to a category of NP-complete problems (non-deterministic polynomial time
complete), actually represent a class of extremely problems with enormous
computational complexity
*real problems solved by heuristic algorithms
Algorithm
Heuristics:
* given the correspondences of nodes (vertices) in two graphs,
we want to identify whether the two graphs contain locally related
regions
* when two graphs are viewed as being linked by correspondences
(additional edges), then the problem becomes finding clusters of
those correspondences
C2
C1
Clustering
algorithm
G1
G2
G1
G2
Algorithm
* if the the set contains n correspondences (Virtual edges), the problem
is to cluster these n data points according to a certain distance measure
* each datapoint represents a correspondence between a node in G1 and
a node in G2
* the distance between two data points i and j may be
defined by two distances
d1(i, j) - for the shortest path between nodes v1i and v1j in graph G1
d2(i, j) - for the shortest path between nodes v2i and v2j in graph G2
v1i
v1j
G1 = (V1, E1)
v2i
v2j
correspondence
(binary relationship)
(virtual edge)
G2 = (V2, E2)
Algorithm
*first, each correspondence is considered as an individual cluster
* initially there are n initial clusters
C1 v1i
v1j
G1 = (V1, E1)
v2i C2
v2j v1i
v1j
G2 = (V2, E2)
G1 = (V1, E1)
v2i
v2j
G2 = (V2, E2)
Algorithm
*then single linkage clustering is performed according to the following
criterion whether to merge two clusters Ci and Cj:
1 if min{d1(r, s) | r  Ci, s  Cj} <= 1 + Gap1 and
d(i, j) =
min{d2(r’, s’) | r’  Ci, s’  Cj} <= 1 + Gap2
0 otherwise
where Gap1 and Gap2 are non-negative gap parameters
* if d(i, j) = 1, the clusters Ci and Cj are merged
C1
v1i
v1j
G1 = (V1, E1)
v2i C2
v2j
G2 = (V2, E2)
Algorithm
* extend the problem to finding a correlation of sub-graphs in more than
two graphs (additional graphs provide information about gene-gene
relations that cannot be found in the two graphs)
* correlated gene clusters are connected by links (hyperedges) that link
genes from the corresponding clusters
* the distance between hyperedges reflects the shortest path length
between the nodes in the graphs
* correlated gene clusters: we can find sets of tightly coupled nodes
in the graphs by gathering hyperedges based on their distance
Algorithm
C1
C2
c1
2
c2
2
c1 3
c11
c2 3
c2 1
G1
Genome
G2
Pathway
Input datasets:
n graphs
m hyperedges
n graphs denote a hyperedge with an n-tuple
G3
Similarity
G = {G1, …, Gn}
H = {h1, …, hm}
hi = (x1, i1, …., xn, in)
The kth element hik = xk, ik is Gk‘s node that constitutes the hyperedge
(1<= k <= n) (assume a hyperedge has exactly n nodes)
Algorithm
C1
C2
c1
2
c2
2
c1 3
c11
c2 1
k
G1
G2
c2 3
G
set of hyperedges C1 :
C1 = {hs1, …, hsp)
set of hyperedges C2 :
C2 = {ht1, …, htq)
set of kth elements of hyperedges in C1:
C1k = {hks1, …, hksp)
set of kth elements of hyperedges in C2:
C2k = {hkt1, …, hktq)
C1
C2
c11
x
c2 1
y
G1
c1
2
c2
2
G2
c1 3
c2 3
G
* d(x, y) is the length of the shortest path between nodes x and y in graph Gs ( can be
calculated by Dijkstra’s algorithm)
* distance dis(C1s, C2s) = max{d(x, y) | x  C1s, y  C2s} for complete linkage
clustering
* distance between two sets of hyperedges C1 and C2 :
D(C1 , C2 ) = S dis(C1s, C2s)
(1<= s <= n)
C1
C2
h1
h2
h3
c11
c2
h4
h5 h6
1
G1
c1
2
c2
2
G2
In our case:
H = {h1, h2, h3, h4, h5, h6}
C1 = {h1, h2, h3}
C2 = {h4, h5, h6}
D(C1 , C2 ) = S dis(C1s, C2s) =
1<= s <= 3
= dis(C11, C21)+ dis(C12, C22)+dis(C13, C23
= 8 + 8 + 8 = 24
c1 3
c2 3
G
Clustering of hyperedges
* using the distance D we cluster the hyperedges into an initial set of
clusters, each of which consists of a single hyperedge C only
* we iterate the procedure to pick two clusters between which the distance
is the smallest
* merge them into a new cluster (hierarchical clustering using distance D)
* in order to merge D must be under a given threshold pi for graph Gi
* if pathlength between two nodes x and y is larger than pi, set d(x, y) to
infinity to eliminate that path; clusters with infinity distance are not
merged
* when there are no more clusters between which the distance is different
than infinity, the clustering is done
Visualizing the clusters
* if the clusters were visualized in 2D, then the distance limit pi
will correspond to a radius pi within which nodes of one graph can be
clustered (to avoid merging distant genes in the same graph)
* only nodes within circles that intersect can potentially form a bigger
cluster
C1
C1
z
C3
pi
y
x
pi
C2
Initial
. clusters:
.
.
.
.
.
.
.
.
C1
Merge
C1 and C2
C2
C3
.
.
.
.
.
Set distance to infinity between C1.and C3 and C1 and C2
.
.
.
The. final clusters:
.
.
.
.
.
.
.
.
C1
C2
.
.
.
.
.
.
.
There are no more clusters to join since
all the distances between
.
.
clusters are now infinity
Homologous gene clusters in the genomes of E. coli and H. influenzae
Applications of the algorithm
* recent high-throughput technologies provide vast amounts of biological
data; contain unknown or hypothetical or erroneous relationships among
genes
* standard approaches cluster data only according to one biological
parameter (e.g. microarray data are clustered by expression patterns only)
may uncover links between known and unknown genes
* advantage of the correlated gene clusters: incorporate in the analysis
multiple biological criteria (graphs); if relationships among genes/gene
products cannot be explained or do not make sense in a single dataset,
multiple datasets will increase the likelihood of deducing the potentially
biologically significant relationships. The algorithm, alternatively, can emphasize
a relationship that might have been uncovered by clustering techniques
* next step - find relations among genes in the correlated gene clusters
Summary:
* Microarray system basics
* Data collection, normalization, similarity measures
* Expression matrix and expression vectors
* Analysis of expression data - Clustering algorithms
- Hierarchical clustering
- k-means clustering
- Principal Component Analysis
- Supervised clustering
5. “Extraction of correlated gene clusters by multiple graph
comparison”