Download Gene Clustering - Bioinformatics at School of Informatics, Indiana

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Lac operon wikipedia , lookup

List of types of proteins wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Gene expression wikipedia , lookup

Genomic imprinting wikipedia , lookup

Promoter (genetics) wikipedia , lookup

RNA-Seq wikipedia , lookup

Community fingerprinting wikipedia , lookup

Molecular evolution wikipedia , lookup

Gene desert wikipedia , lookup

Gene wikipedia , lookup

Gene nomenclature wikipedia , lookup

Ridge (biology) wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gene regulatory network wikipedia , lookup

Gene expression profiling wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Gene Clustering
Haleh Ashki
School of Informatics, Indiana University, Aug 2008
Advisor: Professor Sun Kim
Goal of the project
Gene cluster prediction algorithms are useful in discovering a set of
gene “conserved” in a pair of genomes.
However, the prediction result depend highly on the phylogenetic
distance of two genomes.
In particular, when two genomes are close, sizes of predicted gene
clusters are large, containing several functional gene sets in one
cluster.
Ecoli - Salmonella
Ecoli - Shigella

Thus a new computational tool is needed to predict “functionally
related gene sets”

In this study, we developed a novel computational method to predict
functionally related gene sets from gene clusters, using
gene-ontology based clustering of genes and one dimensional
dynamic programming techniques.
The input for this algorithm are the EGGS Clusters algorithm output:
EGGS: Extraction of Gene clusters by iteratively using Genome context
based Sequence matching techniques.
Genes are matched between two genomes using two concepts,
pairs of close bidirectional best hits (PCBBHs) and pairs of close
homologs (PCHs), where the term close means the physical
proximity, say within 300 bp.
This Cluster Contain 54 genes which have different Operons, Pathways and strand information.
16128413
-
84
path:eco00190
protoheme IX farnesyltransferase (haeme O biosynthesis)
16128414
-
84
path:eco00190
cytochrome o ubiquinol oxidase subunit IV
16128415
-
84
path:eco00190
cytochrome o ubiquinol oxidase subunit III
16128423
+
85
"ATP-dependent specificity component of clpP serine protease, chaperone"
16128424
+
85
"DNA-binding, ATP-dependent protease La; heat shock K-protein"
16128425
+
85
"DNA-binding protein HU-beta, NS1 (HU-1)"
16128426
+
85
peptidyl-prolyl cis-trans isomerase D
16128433
+
86
path:eco02010
ATP-binding component of a transport system
16128434
+
86
path:eco02010
putative ATP-binding component of a transport system
16128435
+
87
nitrogen regulatory protein P-II 2
16128436
+
87
probable ammonium transporter
16128437
-
16128450
-
90
"orf, hypothetical protein"
16128451
-
90
primosomal replication protein N''
16128454
+
91
16128455
+
91
"orf, hypothetical protein"
16128456
+
91
recombination and repair
16128474
+
94
16128477
-
95
16128478
-
95
path:eco00632
acyl-CoA thioesterase I; also functions as protease I
16128479
+
96
path:eco02010
putative ATP-binding component of a transport system
...
...
path:eco00632
acyl-CoA thioesterase II
...
path:eco00230
"DNA polymerase III, tau and gamma subunits; DNA elongation factor III"
...
path:eco02010
putative ATP-binding component of a transport system
putative oxidoreductase
predicted clusters are often too long and need to be dissected; BUT how?
Predicting biologically meaningful gene clusters from conserved gene
clusters:

A conserved gene cluster depends much on phylogenic distance
between two genomes and it often contains “multiple” biologically
meaning clusters.

Our method uses clustering technique using gene ontology
information.

Results from our method are shown biologically meaningful in terms
of operon (a set of genes in a single transcription) and biological
pathways.
GO : Gene Ontology

The GO project has developed three structured controlled vocabularies
(ontologies) that describe gene products in terms of their associated:
1.
biological processes
cellular components
molecular functions in a species-independent manner.
2.
3.
The ontologies are structured as directed acyclic graphs.
GO terms can be linked by different types of relationships: is_a, part_of
For each gene there are more than one GO terms. in all different component
and also in all different level of the hierarchal tree.
Here the UniProt IDs have been used as a key to get the Go terms of each
gene.
Semantic Similarity Value (SS):
Different methods to calculate the semantic similarity value:
Resnik: is solely based on the information content of shared parents of the two
terms. If there is more than one shared parent, the minimum information
content is taken. Then the similarity score is derived as follows:
where S(t1, t2) is the set of parent terms shared by t1 and t2.
Lin and Jiang:
Both methods use not only the information content of the shared parents,
but also that of the query terms
where p(t1), p(t2) and p(t) are information content
values for t1, t2 and their parents, respectively.
Our method : by (James Z. Wang1, Zhidian Du)
The semantic of a GO term is determined by it’s location in the entire GO graph
and semantic relations with all of it’s ancestor term.
So we are using the subgraph, starting from the specific Go term and end at root
(Biological, cellular, Molecular)
In this study I have worked with Molecular Go Terms.
DAGA=(A,TA,EA)
TA :is a set of GO terms,including A and all
it’s ancestors in subgraph.
EA:set of edges.
SV(A)=4.52
Sim(ADh4,Ldb3)=.693
max
.427
.427
.664
.814
.482
.664
.482
max
.664
.664
.814
.390
.480
From Paper
Here I have used the online tool to measure the Semantic Similarity value
for each two genes based on their GO terms.
I made a matrix of semantic value for each group of genes. this value is
normalized between 0 and 1.
•Make the Cluster based on Semantic Similarity Matrix:
0
1
2
3
4
5
6
7
8
9
10
1
1.000
0.250
0.000
0.000
0.000
0.000
0.313
0.571
0.433
0.250
2
0.250
1.000
0.000
0.000
0.000
0.000
0.000
0.250
0.278
0.188
3
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
4
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
5
0.000
0.000
0.000
0.000
1.000
0.500
0.000
0.000
0.000
0.000
6
0.000
0.000
0.000
0.000
0.500
1.000
0.000
0.000
0.000
0.000
7
0.313
0.000
0.000
0.000
0.000
0.000
1.000
0.313
0.222
0.438
8
0.571
0.250
0.000
0.000
0.000
0.000
0.313
1.000
0.900
0.286
9
0.433
0.278
0.000
0.000
0.000
0.000
0.222
0.900
1.000
0.233
10
0.250
0.188
0.000
0.000
0.000
0.000
0.438
0.286
0.233
1.000
Clustering Result:
Value
0.9
0.2
0.4
0.5
Genes
89
12
7 10
65
this method group the genes based on their SS value. Descending (0.9 – 0.1)
So each gene is grouped based on it’s highest SS value.
The genes with SS value of 0 are omitted on this step.
HCluster

Is one of the features of R which make the cluster based on the
Dissimilarity value of group of elements. I have used that for visualization of
clustering based on my Semantic Similarity Matrix.
Hcluster visualization:
Now each Eggs cluster is grouped based on the Semantic similarity value. I made a
key like as:
FirstGenome.SecondGenome.EggClusterNumber.SSvalue
ESC12S0.8 EcoliSalmonellaCluster12Subcluster0.8
In this study I used clusters from four pairs of genomes:
Ecoli Salmonella
Ecoli Yersinia
Ecoli Shigella
Ecoli Shewanella
I gathered all existence keys for each gene in Ecoli genome. For sure more
conserved genes have more keys in all four groups:
 Break point

16131330
ESGc102s0.8
ESc125s0.8

16131335
ESGc102s0.8

16131350

16131351
ESGc102s0.8

ESGc102s0.9
ESc125s0.8

ESc126s0.8

16131352
ESGc102s0.9
EYc25s0.8
EShc106s0.6

EShc107s0.8
EYc25s0.8

EYc99s0.3
EYc99s0.3

EYc99s0.5
Break Point and Cluster Score

Break points are defined in target genome (Ecoli). break points are the
genes which the keys are changed. Based on both “cluster number” or “sub
cluster value”.

All breakpoints are collected and been removed of redundancies.

Formula for “gene set score”:
((# of same keys inside the cluster)/(# of same keys outside the cluster) ) ^ 2
_______________________________________________________________
Size of cluster (number of genes)

Breakpoint1-breakpoint2
genes
#inner gene
# outer gene
Size
gene set Score
16127996-16128002
EYc174s0.6
2
2
5
1
16127996-16128002
ESc3s0.6
2
4
5
0.36
16127996-16128002
EYc174s0.3
2
2
5
1
16127996-16128002
ESc3s0.3
2
2
5
1
16127996-16128002
EShc3s0.4
3
3
5
1
Break point interval score = Sum of gene set score / number of genes
4.36 /5 =0.872
*****************************************
16127996-16127997
0.583
16127996-16127998
0.830
16127996-16128000
0.901
16127996-16128002
0.872
16127996-16128008
0.815
16127996-16128014
0.782
16127996-16128019
0.840
16127996-16128020
0.889
16127996-16128021
0.939
16127996-16128025
0.94
16127996-16128026
0.920
16127996-16128029
0.870
16127996-16128030
0.846
16127996-16128035
0.760
16127996-16128042
0.709
*****************************************

Each group is defined as genes
between each breakpoint and the 5th
,10th ,15th break point ahead.

Here: 15 break points in group
Problem definition

any pair of breakpoints can define a functionally related gene set, but there
are too many candidates: O(n^2) for n break points.

We formulate a problem of functional gene set prediction as generating
maximal cover of genes based on the Break point interval score .

This problem is similar to exon chaining problem that predict exons from a
number of intron-exon boundaries.

Thus we used one dimensional dynamic programming technique to
solve the functional gene set prediction problem:
Select non overlapping break points’ intervals that maximize sum of break
point interval scores.
One dimensional dynamic programming
16127996
On each group ( each breakpoint with the next 5th,.. Breakpoint ) the
four highest score have been chosen as blocks for dynamic programming.
This dynamic programming get the block as potential clusters, the start and
stop position and the weight of that block (“Break point interval score”). and
finally generate the clusters with highest score.
This algorithm is modified based on our data such as overlapping on end
points etc.
One more step to refine predicted clusters

Strand Information:

Connected gene neighborhoods in prokaryotic genomes Nucleic Acids
Research, 2002, Vol. 30, No. 10 2212-2223:
the genes which have the same function are in the same
direction.

So the strand information of Ecoli genome as target is used to
dissect each cluster.
in this step the clusters are dissected based on the strand
information.
The new clusters with one gene are removed.
Gene Id
Start Position
End Position Strand
Operon ID
Pathway
************************************
16132180
4595173
4597425
-
16132182
4598261
4598998
-
787
16132183
4599001
4599540
-
787
16132188
4602898
4603686
-
16132189
4604692
4605723
-
16132190
4605826
4606239
+
789
16132191
4606208
4606654
+
789
16132192
4606669
4607346
+
789
16132193
4607437
4609026
+
16132195
4610434
4611507
+
4612703
4613566
-
790
16132198
4615346
4616125
+
791
eco00030
16132199
4616252
4617574
+
791
eco00230
16132200
4617626
4618849
+
791
eco00030
16132201
4618906
4619625
+
791
eco00230
4621124
4622140
-
------------------------------------------eco00230
eco00230
------------------------------------------16132196
-------------------------------------------
------------------------------------------16132203
************************************
eco00785
Predicted gene clusters verify in terms of:

Definition of each gene: NCBI

Operon information
Detecting uber-operons in prokaryotic genomes, Dongsheng Che2, Guojun
Li, Nucleic Acids Research, 2006
Database: http://csbl.bmb.uga.edu/uber/
This DB has grouped genes based on the operons they belongs too.Each
Uber_Operon gropu represent a rich set of footprints of operon evolution.

KEGG Pathway:
a metabolic pathway is a series of chemical reactions occurring within a cell.
In each pathway, a principal chemical is modified by chemical reaction.
Enzymes catalyze these reactions.
Database: http://www.genome.jp/kegg/
absence of information for non enzyme genes make that not very useful.
Summary
EGGS: (Ecoli-Salmonella)
Cluster Numbers:167
Gene range:2-130 (2-50)
Operon Id Range:0-42
Our Method:
Cluster Numbers: 483
Gene range:2-25 (2-10)
Operon Id Range: 0-6
Conclusion
By dissecting big conserved clusters we will
have functionally meaningful related genes
clusters without worry about phylogenetic
distance of genes.
Literature










Resnik P: Semantic similarity in a taxonomy: an information-based measure
and its application to problems of ambiguity in natural language. J Artif Intell
Res, 1999, 11:95-130.
Lin D: An information-theoretic definition of similarity. In: International
Conference on Machine Learning: 1998; San Fransisco: Morgan Kaufmann; 1998:
296-304.
Jiang JaC, DW: Semantic similarity based on corpus statistics and lexical
taxonomy. In: Proceedings of 10th International Conference on Research In
Computational Linguistics. Taiwan; 1997: 19-33.
Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F: A new method to measure the
semantic similarity of GO terms. Bioinformatics 2007, 23(10):1274-1281.
EGGS: Extraction of Gene clusters using Genome context based Sequence
matching techniques. Kwangmin Choi, Bharath Kumar Maryada,SunKim
Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for
deciphering the genome. Nucl Acids Res 2004, 32(90001):D277-280.
Database:http://www.genome.jp/kegg/
Connected gene neighborhoods in prokaryotic genomes Nucleic Acids
Research, 2002, Vol. 30, No. 10 2212-2223:
Genome Alignment, Evolution of Prokaryotic Genome Organization, and
Prediction of Gene Function Using Genomic ContextYuri I. Wolf, Igor B. Rogozin,
Alexey S. Kondrashov, and Eugene V. Koonin Research 11:3 356-372 (2001)
Detecting uber-operons in prokaryotic genomes, Dongsheng Che2, Guojun Li,
Nucleic Acids Research, 2006
Online resources:








http://bioinformatics.clemson.edu/G-SESAME
http://csbl.bmb.uga.edu/uber/
http://www.geneontology.org/
http://bioconductor.org
http://www.r-project.org
http://platcom.org/EGGS
http://www.genome.jp/kegg/
http://www.ncbi.nlm.nih.gov/
Thanks



Professor.Sun Kim
Professor.Dalkilic
Kwangmin choi , youngik yang

Professor.Tang,Professor.Radivojac and all other Informatics
faculties.


Informatics Staffs. Mis.Linda Hostetter
All Graduate Students (my Friends)

Profesoor.Kehoe

School of informatics.