Download Bicat-plus_preseneta.. - k

Document related concepts

Oncogenomics wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene therapy wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Pathogenomics wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Minimal genome wikipedia , lookup

Public health genomics wikipedia , lookup

Gene desert wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene nomenclature wikipedia , lookup

NEDD9 wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genome (book) wikipedia , lookup

Ridge (biology) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
1/42
BicAT_Plus: An Automatic Bi/Clustering Comparative Tool
of Gene Expression Data Obtained Using Microarrays
Fadhl M. Al-Akwaa
Biomedical Eng. Dept., Univ. of Science & Technology, Sana’a, Yemen
Biomedical Engineering Department, Cairo University, Giza , Egypt
Mohamed H. Ali
Computer Science School, Nottingham University, Nottingham, United Kingdom
Yasser M. Kadah
Center for Informatics Sciences, Nile University, Egypt
Biomedical Engineering Department, Cairo University, Giza , Egypt
2/42
What is Bioinformatics?
Bioinformatics is defined as the creation and advancement of databases,
algorithms, computational and statistical techniques, and theory to
understanding of biological processes.
3/42
The Central Dogma
DNA
nucleus
Transcription
RNA
cytoplasm
Gene Expression
Level
PROTEIN
A
Translation
F
G
N
S
T
D
cytoplasm
K
G
S
A
4/42
Biological Balance Feedback System
+
Gene on
Translation
Rate
Disease
Drug
Gene off
+
Gene
Expression Level
GENE A
Protein
Level
+
Transcription
Rate
-
External or
internal stimuli
5/42
Transcriptome data: Microarray Technology
C1
Gene Expression Data
Cm
G1
0.5
1
2
Gn
3
2
1
6/42
Biological Balance Feedback System
+ Translation
Rate
Translation
Rate
+
Gene Expression
Level
+
GENE A
Protein
Level
+
Gene
Expressi
on Level
GENE B
+
Transcription
Transcription
Rate
External or
internal stimuli
Protein
Level
-
Rate
-
+
7/42
Biological Balance Feedback System
_
Gene A
g1
_
+
-
g2 _
+
g3
+
g4
Gene B
Balance Feedback Loop system
_
_
Gene Regulatory Network
GRN
8/42
Gene Regulatory Network GRN
9/42
Biological Data Base
DNA
Transcription
RNA
Translation
PROTEIN
A
F
G
N
S
T
D
K
G
S
A
10/42
Drug Discovery
• One of the main objective of bioinformatics is how to integrate this
database to advance in human health.
Drug Discovery
Disease Ontology
11/42
Drug Discovery & GRN
•The costs to bring a new drug vary from around 500 million to 2,000 million
dollars
•Drug Design required the sophisticated understanding of how genes interact
with each others  construct GRN.
_
g1
_
+
g2 _
+
g3
+
g4
_
_
12/42
Drug Discovery: GRN steps
Experimental Design
Prepare Microarray
chip
Sampling rate
Error
Experimental
condition
Data Extraction
Microarray Image
Segmentation
Preprosseing
Gene Expression Matrix
c1
c2
g1
g2
gn
Dynamics Bayesian
Network
Probabilistic
boolean Network
Fuzzy network
………
Drug Testing
Network Generation
cm
Normalization
Discretization
Filtration
Missing value
Low entropy
Low variance
Traditional clustering
methods
Bicluster methods
Gene Clustering
13/42
Gene Expression Data Analysis: Clustering
similarity matrix
cluster
genes based
on similarity
n
genes
n
genes
m assays
n genes
Euclidean Distance
Correlation coefficient
Pearson
14/42
Hierarchical Clustering
g1
g1
g2
g3
g4
g2
g3
g4
g5
0.23
0.00
0.95
-0.63
0.91
0.56
0.56
0.32
0.77
-0.36
g5
• Find largest value in similarity matrix.
g1
• Join genes together.
• Recompute matrix and iterate.
g4
15/42
Hierarchical Clustering
g1 , g4
g1 ,
g4
g2
g3
0.37 0.16
g5
0.52
g2
0.91 0.56
g3
0.77
g5
• Find largest value is similarity matrix.
g1
• Join clusters together.
• Recompute matrix and iterate.
g4
g2
16/42
g3
Hierarchical Clustering
g1 , g4
g1 ,
g4
g2 ,
g3
g2 , g3
g5
0.27
0.52
0.68
g5
• Find largest value is similarity matrix.
g1
• Join clusters together.
• Recompute similarity matrix and iterate.
g4
g5
g2
17/42
g3
Hierarchical Clustering : dendogram
Eisen et al. (1998), PNAS, 95(25): 14863-14868
18/42
Gene Expression Data Analysis: Clustering
• Cluster is a group of
genes show similar
expression profile
along the experiments
• Examples
–
–
–
–
–
K-means
Hierarchal
Self Organization Map
Click
Model based clustering
Eisen et al. (1998), PNAS, 95(25): 14863-14868 19/42
Gene Expression Data Analysis: Clustering Limitations
c10
c1
c2
c3
c4
c5
c7
c8
c9
g1
3
4
1
1
7
10
11
1
1
g2
5
6
1
1
0.5
0.1
1
1
1
g3
2
2
2
2
2
2
2
2
2
g4
1
1
1
1
2
2
2
1
1
g5
3
4
4
2
5
4
7
9
8
g6
6
7
1
9
0
6
4
2
1
g7
0.5
0.1
1
2
2
2
2
2
5
20/42
Gene Expression Data Analysis: biClustering
the mean squared residue score (MSRS),
George M. Church
Professor of Genetics,
Harvard Medical School
21/42
Biclustering Algorithms
Algorithm
Author
Bivisu/ pClusters
Kin-On Cheng et al.,2008
Haixun Wang, 2002
RMSBE
Xiaowen Liu and Lusheng Wang, 2006
Bimax
ROBA
Preli et al., 2006
Alain B. Tchagang and Ahmed H.
Tewfik, 2005
x-motif
SAMBA
Murali and Kasif, 2003
Tanay et al., 2002
OPSM
Plaid
Ben-Dor et al., 2002
Laura Lazzeroni and Art Owen, 2000
ISA
CC / δ biclusters
Ihmels et al., 2002
Cheng and Church, 2000
22/42
Paper IDEA
Which algorithm is
suitable for my dataset?
Which algorithm is better? And
do some algorithms have
advantages over others?
Generally, comparing different biclustering
algorithms is not straightforward as they differ
in strategy, approach, computational
complexity, number of parameters, and
prediction ability.
Moreover, such methods are strongly influenced
by user selected parameter values.
23/42
BicAT-plus
• To our best knowledge, bicluster compassion
toolbox has not been available in the literature.
• We have developed a comparative tool, which
we will call “Bicat-plus” that includes the
biological comparative methodology to enable
researchers and biologists to compare between
the different bi/clustering methods based on set
of biological value and draw conclusion on the
biological meaning of the results.
24/42
BicAT
BicAT-plus is extension of
BicAT Toolbox which is
popular gene expression
analysis toolbox which
contains 5 biclustering
and 2 traditional cluster
algorithm.
•OPSM
•CC
•ISA
•X-motive
•BIMAX
•K-means
•Hierarchal
25/42
BicAT-plus Comparison Methodology
Algorithm A (n biclusters)
g1,
g1,g
g4,
2,g3
g5,
g1,g g1
,g4,
2,g3, ,g
g5,
g1,g2
g4,g 2,
…
,g3,g
5,… g3
4,g5,
,g
g1,g
…
4,
2,g3,
g5
g4,g
,…
5,…
Function
GO
Algorithm A (m biclusters)
g1,g
2,g3
,g4,
g5,
g1,g2
…
,g3,g
Enriched bicluster=
have biological meaning
Pathway
KEGG
g1,
g4,
g5,
g1,g g1
2,g3, ,g
g4,g 2,
5,… g3
4,g5,
g1,g
…
2,g3,
g4,g
5,…
PPI
BIOGRID
,g
4,
g5
,…
Enriched
not
Enriched
Promotor
GENE BANCK
26/42
BicAT-plus Comparison Methodology
• Percentage of enriched bi/clusters
Percentage of enriched bicluster significan ce level 
Number of enriched biclusters at this level
total number of biclusters
• Percentage of annotated genes per
each bi/cluster
Study fraction of a GO term 
No of genes sharing the GO term in a bicluster
 100
total number of genes in this bicluster
• The predictability power of algorithm to
recover interested pattern selected by user.
27/42
BicAT-plus Features
1. Adding more
algorithms to the
BicAT-plus tool in
order to have one
software package
that employs most of
the commonly used
biclustering
algorithms.
28/42
BicAT-plus Features
2. Perform functional
analysis (Gene
Ontology) of
bicluster genes
using different GO
categories
2. Biological Process
3. Molecular Function
4. Cellular Component
29/42
BicAT-plus Features
3. Displaying the
analysis and
comparing results
using graphical and
statistical charts
visualizations in
multiple modes (2D
and 3D).
30/42
BicAT-plus Features
4. Comparing
between the
different
biclustering
algorithms based
on different
respective
methdology
31/42
BicAT Comparison Steps
Manual
file
http://home.k-space.org/FADL/Downloads/BicAT_plus.zip
32/42
Results
We used Gasch gene expression data.
http://genome-www.stanford.edu/yeast_stress/
We used the default parameters as authors
recommend in their publications.
Bi/clustering
Parameter settings
Algorithm
ISA
tg = 2.0, tc = 2.0, seeds = 500
CC
δ = 0.5, α = 1.2, M = 100
OPSM
l = 100
BiVisu
Ε = 60, Nr = 10, Nc = 5, = 25
K-means
K=100
33/42
Percentage of enriched bi/clusters
34/42
Percentage of annotated genes per each bi/cluster
35/42
The predictability power of algorithm
to recover interested pattern
• The conditions applied in Gasch experiments
varied from temperature shocks, hydrogen
peroxide, the superoxide-generating drug
menadione, the sulfhydryl-oxidizing agent
diamide, the disulfide-reducing agent
dithiothreitol, ……
• The user could compare bi-clusters algorithms
based on which of them could recover defined
pattern like which one of them could recover
biclusters which have response to the conditions
applied in Gasch experiments.
36/42
GO Term / (number of annotated genes)
K-means
CC
ISA
Bivisu
OPSM
GO:0006970
response to osmotic stress / (83)
3
5
6
3
0
GO:0006979
response to oxidative stress / (79)
2
7
11
0
0
GO:0046686
response to cadmium ion / (102)
GO:0043330
response to exogenous dsRNA / (7)
2
3
2
2
0
2
3
2
2
0
2
0
2
2
0
3
0
2
2
0
0
0
2
0
0
0
2
0
0
0
GO:0006995
cellular response to nitrogen starvation / (5)
4
4
4
0
0
GO:0042149
cellular response to glucose starvation / (5)
0
2
0
0
0
GO:0009651
response to salt stress / (15)
2
7
0
0
0
GO:0042542
response to hydrogen peroxide /(5)
0
0
0
2
0
GO:0000304
response to singlet oxygen / (4)
2
0
0
0
0
GO:0046685
response to arsenic / (77)
GO:0009408
response to heat / (24)
GO:0009409
response to cold / (7)
GO:0009267
cellular response to starvation / (44)
37/42
Conclusion
http://home.k-space.org/FADL/Downloads/BicAT_plus.zip
• BicAT-plus is a flexible, open-source
software tool written in java swing and it
has a well structured design that can be
extended easily to employ more
comparative methodologies that help
biologists to extract the best results of
each algorithm and interpret these results
to useful biological meaning.
38/42
BicAT-plus This figure for people that want to extend
BicAT-plus by adding new features (or fixing bugs).
39/42
Conclusion
• The comparison methodology used in this
study confirm that the bicluster and cluster
algorithms can be considered as
integrated modules; there is no certain
algorithm that can recover all the
interesting patterns, what algorithm A
success to recover in certain data sets,
Algorithm B might fail, and vice verse.
40/42
Conclusion
• Using BicAT-plus, we can identify the
highly enriched bi/clusters of the whole
compared algorithms, Integrating them to
solve the dimensionality reduction problem
of the Gene regulatory network
construction from the gene expression
data where samples number are fewer
than number of genes in the microarray
dataset.
41/42
Thanks
42/42
BicAT-Plus
http://home.k-space.org/FADL/Downloads/BicAT_plus.zip
43/42
Availability and Requirements
• Availability: you can free download from
• System requirements
1. Java Runtime Environment (JRE). version 6 is recommended.
2. Active Perl version 5.10
Note
BicAT plus has been tested on a PC machine with the following
configurations: CPU: Pentium 4, 1.5 GHZ, RAM: 2.0 GB, Platform:
windows XP professional with SP2.
44/42
Algorithms comparison
• Generally, comparing different biclustering
algorithms is not straightforward as they differ in
strategy, approach, computational complexity,
number of parameters, in addition to prediction
ability.
• Moreover, such methods are strongly influenced
by user selected parameter values. As a result,
the quality of biclustering results is often
considered more important than the required
computation time.
45/42
Algorithms comparison
• Although there are some analytical
comparative studies to evaluate the
traditional clustering algorithms (Azuaje,
2002; Datta and Datta, 2003; Yeung, et
al.), no such comprehensive comparison
of biclustering methods can be found in
the literature so far (Prelic, et al., 2006).
46/42
Cluster/bi-cluster algorithm performance
comparison: Cluster Evaluation
Cluster 1
g1,g2,g3,g4
,g5,…
Cluster 2
g1,g2,g3,g
4,g5,…
Cluster n
……..
g1,g2,g3,g4,g5,
…
• Homogeneity between cluster genes
• Separation between clusters
“it is not clear how to extend notions such as homogeneity and
separation (Gat-Viks et al., 2003) to the biclustering context
(to our best knowledge, no general internal indices have been
suggested so far for biclustering) “ Prelic, et al., 2006
47/42
Cluster/bi-cluster algorithm performance
comparison: Bicluster Evaluation
bicluster
g1,g 1
2,g3
,g4,
g5,
…
g1, 2
bicluster
g2,
g3,
g4,
g5,
…
……..
g1, n
bicluster
g2,
g3,
g4,
g5,
…
Function: hypergometric test with GeneOntology database
Pathway: KEEG
PPI: Biograd database
Promotor: Scan motif program
48/42
Hyper Geometric Test
Cluster1
Test set (X genes)
Reference set
(N genes)
g1, g2,
g3, g4,
g5,
g6,g7,g8,
g9,gN
when sampling X
genes (test set) out of
N genes (reference
set), what is the
probability that x or
more of these genes
belong to a functional
category C shared by
n of the N genes in
the reference set?”.
g1, g2, g3,
g4, g5,
g6,g7,g8,g
9,gX
Steven et al.(Maere, et al., 2005)
49/42
The Gene Ontology
• The Gene ONTOLOGY (GO) is a project
to put annotated genes( known function
genes) in groups.
• Example in S. cerevisiae
• Function name =cellular
g1, g2, g3, g4, g5,
g6
response to glucose starvation
function ID=GO:0042149
50/42
Hyper Geometric Test: Example
Cellular response to glucose starvation
Cluster1(10)
GO:0042149 (6)
g1, g2,
g3, g4,
g5, g6
2,3,4,5,6
g1, g2, g3,
g4, g5,
g6,g7,g8,g
9,g10
51/42
GO enrichment program with
Hypergometric Test
•
•
•
•
•
•
•
FuncAssociate
GeneMerge
GoMiner
FatiGO
GOstat
GO::TermFinder
http://www.geneontology.org/GO.tools.shtml
• we used GeneMerege program which were developed
at University of Maryland C. I. Castillo-Davis, 2003
52/42
GO Analysis programs
Limitations
Reference set
Test set
53/42
BicAT
Swiss Federal Institute of Technology Zurich, ETH Zentrum, 8092 Zurich,
Switzerland
OPSM
CC
ISA
X-motive
BIMAX
K-means
Hierarchal
54/42
BicAT-plus
• To our best knowledge, such An automatic gene
ontology compassion tool has not been
available in the literature.
• We have developed a comparative tool,
which we will call “Bicat-plus” that includes
the biological comparative methodology
and to be as an extension to the BicAT
program.
55/42
BicAT-plus
• Moreover, BicAT-plus help researchers in
comparing and evaluating the algorithms
results multiple times according to the user
selected parameter values as well as the
required biological perspective on various
datasets.
56/42
Gene Expression Data Analysis: biClustering
• Recent understanding of
cellular process leads to
expect subsets of genes to
be coregulated and
coexpressed under certain
experimental conditions,
but to behave almost
independently under other
conditions.
A. Prelic,2006, Bioinformatics
•
Bicluster is a group of
genes show similar
expression profile under
certain conditions.
57/42
Cluster/bi-cluster algorithm performance
comparison: Bicluster Evaluation
bicluster
g1,g 1
2,g3
,g4,
g5,
…
g1, 2
bicluster
g2,
g3,
g4,
g5,
…
……..
g1, n
bicluster
g2,
g3,
g4,
g5,
…
Function: hypergometric test with GeneOntology database
Pathway: KEEG
PPI: Biograd database
Promotor: Scan motif program
58/42