Download talk

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Molecular evolution wikipedia , lookup

Secreted frizzled-related protein 1 wikipedia , lookup

Gene expression wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Gene wikipedia , lookup

Genome evolution wikipedia , lookup

Genomic imprinting wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Community fingerprinting wikipedia , lookup

Gene regulatory network wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Comparisons and validation
of statistical clustering
techniques for microarray
gene expression data
Susmita Datta and Somnath Datta
Presented by: Jenni Dietrich
Assisted by: Jeffrey Kidd and Kristin Wheeler
Mentor: Dr. Takis Benos
26 June 2003
Outline
„ Brief
microarray overview
„ Purpose of the paper
„ Discuss clustering algorithms
„ Experiment and Results
„ Conclusions
Microarrays
Microarrays
„ Allow
„ Results
for monitoring of gene expression
at the transcript level
„ Slide with single-stranded DNA
molecules attached at fixed positions
(probes)
„ Exploit the complementary binding of
single-stranded DNA sequences
in a large data set containing
expression levels of thousands of genes
„ Microarray experiments are often used
to track the changes in gene expression
– Over time
– In the presence of various agents
1
Microarrays
Microarrays
Microarray
Slide
„ Gene
Microarrays
Gene
Expression
Matrix
*Numerical
values
encoded by
color
expression profiles characterize
the dynamic functioning of each gene in
the genome
„ Expression data can be represented as
a matrix where the rows are genes and
the columns are samples
„ The values in the cells of the matrix
represent the expression levels
Experiment
„ Experiment
Design
– Decide on probes and genes
– Type of microarray
„ Data
Normalization
„ Data analysis
– Identify differentially expressed genes
– Cluster genes based on expression
patterns
2
Clustering
„
Goal of microarray data analysis
– identify changing levels of gene expression
– correlate the changes to identify sets of genes
with similar profiles
„
Clustering
– group objects into subsets
„
Clustering algorithms can be used to group
genes that have similar expression patterns
Clustering Algorithms
„
„
„
„
„
„
Hierarchical clustering with correlation –
UPGMA (most commonly used algorithm)
Clustering by K-means
Diana
Fanny
Model-based clustering
Hierarchical clustering with partial least
squares
Purpose of the Paper
„ Currently,
there are no clear guidelines
for choosing a clustering algorithm to
group genes based on their expression
profiles
„ This paper evaluated the performance
of six different algorithms using a
microarray data set on sporulation of
budding yeast
Clustering Algorithms
„ Algorithms
differ in the measure of
similarity used when grouping the
objects and grouping technique
„ Some need and use previous knowledge
about the suspected number of clusters
3
Hierarchical clustering
„
„
„
„
Produces a hierarchy of clusters rather than
a predefined number of clusters
(agglomerative approach)
Initially, each observation is in its own
cluster
Subsequently, the two ‘closest’ clusters are
combined into a single cluster
The similarity (distance) measure used is
the average method
– Distance between clusters is the average of the
distances between the points in one cluster and
those in the other cluster
Diana
„ Divisive
K-means Clustering
„ Uses
advance knowledge about the
number of clusters to be formed
– k clusters
„ Initially,
all objects are randomly
assigned to one of k clusters
„ Objects are moved between clusters in
an attempt to minimize the distance
between that object and its cluster
Fanny
clustering method
– All objects start in one cluster and broken
into smaller groups
with larger dissimilarity are put
in different clusters
„ Uses the standard Euclidean distance
measure
„
„
„ Genes
„
Uses fuzzy logic and produces a probability
vector for each observation
Hard cluster is formed by assigning an
observation to a group with the highest
probability
Uses the Manhattan distance measurement
– d=|x-u|+|y-v|
• where (x,y) and (u,v) are two points
„
Needs a predefined number of clusters (k)
4
Model-based Clustering
Experiment
„ Treats
„
data as a mixture distribution
– Often based on a Gaussian distribution
„ Describes
each cluster using a
probabilistic model
„ No predefined number of clusters
Results
„
„
Run each of the six clustering algorithms with
sporulation of yeast data set
Used three validation measurements to
compare the results
– Average proportion of non-overlap measure
– Average distance between means measure
– Average distance measure
Average proportion of non-overlap &
average distance between means measures
Based on average proportion of nonoverlap and average distance between
means measures:
– Model-based clustering appears to be worst
– Hierarchical clustering with correlation and
Fanny seem to be the best
„
Based on average distance measure:
„
Overall, Diana performed consistently well
for all three measurements
– Hierarchical clustering performed worst
5
Average distance between mean
measures
Comparison of model profiles
The smaller the distance from the model profile
the closer the results of that algorithm are to the
results of the model.
Average Temporal Profiles
Time (hours)
Group category
0
2
I
52 genes
II
62 genes
III
47 genes
IV
95 genes
V
158 genes
VI
61 genes
VII
5 genes
4
6
8
10
12
Chu et al. (1998) Science 282: 699-705
6
Guidelines for choosing a
clustering algorithm
1.
2.
3.
Produce a visual plot of the first two
principal components to determine the
method that gives the most separation
between groups
Check for consistency of the method with
temporal observations
Compare the average group temporal
profiles with the model profiles produced
from a known, hand-picked set of genes
(training set)
In conclusion…
Other aspects
„ It
may be important to inspect the
computational stability as well as the
computational time of an algorithm
before making a choice
References
„
„
„
„
The clustering algorithm directly effects the
interpretation and analysis of the data
Therefore, careful consideration of the
algorithms is necessary before a choice is
made
For this data set, Diana performed
consistently well when compared with the
model profile and when looking at the three
validation measures
„
„
„
„
„
Datta, S, and S Datta. (2003) Comparisons and validation
of statistical clustering techniques for microarray gene
expression data. Bioinformatics(19) 459-466
Brazma, A, and J Vilo. (2000) Gene expression data
analysis. FEBS (Letters 480) 17-24
Quackenbush, J. (2001) Computational Analysis of
Microarray Data. Nature Reviews (2) 418-427
Hastie, T, R Tibshirani, and J Friedman. (2001) The
Elements of Statistical Learning 453-480
http://www.austinlinks.com/Fuzzy/overview.html
http://www.maths.lth.se/help/R/.R/library/cluster/html/fanny
.html
7