Download Comparative genomics and genome annotation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Comparative genomics, genome context and genome annotation
Nothing in (computational) biology makes
sense except in the light of evolution
after Theodosius Dobzhansky (1970)
Genome context analysis and genome annotation
Using information other than homologous relationships
between individual gene/proteins for functional prediction
(guilt by association)
Types of context analysis:
•phyletic patterns
•domain fusion (“Rosetta Stone” proteins)
•gene order conservation
•co-expression
•….
Goals:
• Using gene sets from complete
genomes, delineate families of
orthologs and paralogs - Clusters of
Orthologous Groups (of genes) (COGs)
• Using COGs, develop an engine
for functional annotation of new
genomes
• Apply COGs for analysis of
phylogenetic patterns
COG:
- group of homologous proteins such that all
proteins from different species are orthologs (all
proteins from the same species in a COG are
paralogs)
CONSTRUCTION OF COGs FOR 8 COMPLETE GENOMES
Complete set of proteins
from the analyzed genomes
Merge triangles with common edges
1
6
FULL SELF-COMPARISON
(BLASTPGP, no cut-off)
Detect groups with multidomain
proteins
and isolate domains
2
5
Collapse obvious paralogs
3
REPEAT STEPS 3-5
Detect all interspecies Best Hits (BeTs)
between individual proteins or groups of paralogs
4
Detect all triangles of consistent BeTs
COGs
A TRIANGLE OF BeTs IS A MINIMAL, ELEMENTARY
COG
A RELATIVELY SIMPLE COG PRODUCED BY
MERGING ADJACENT TRIANGLES
A COMPLEX COG WITH MULTIPLE PARALOGS
Current status of the COGs
Prokaryotes
11 Archaea + 1 unicellular eukaryote + 46 bacteria =
58 complete genomes
149,321 proteins
105,861 proteins in 4075 COGs
(71%)
Eukaryotes
4 animals + 1 plant + 2 fungi + 1 microsporidium = 8 complete genomes
142,498 proteins
74,093 proteins in 4822 COGs
(52%)
COGnitor...
…IN ACTION
The Universal COGs
Search for genomic
determinants of
hyperthermophily
Search for unique
archaeo-eukaryotic
genes
A complementary pattern:
search for unique
bacterial genes
Essential function…
but holes in the phyletic
pattern
Strict complementary pattern
Relaxed complementary
pattern
Relaxed complementary
pattern with extra restrictions
Conservation of gene order in
bacterial species of the same
genus
1
1
101
M. genitalium
vs
M. pneumoniae
201
301
401
501
601
101
201
301
401
Conservation of gene order in
closely related bacterial genera
1
1
101
201
C. trachomatis
vs
C. pneumoniae
301
401
501
601
701
801
901
1001
101
201
301
401
501
601
701
801
Lack of gene order conservation - even
in “closely related” bacteria of the same
Proteobacterial subdivision
1
101
201
301
401
501
601
701
801
901
1001
1101
1201
1301
1401
1501
1601
1701
1801
1901
2001
2101
2201
2301
2401
2501
2601
2701
2801
2901
3001
3101
3201
3301
3401
3501
3601
3701
3801
3901
4001
4101
4201
ecoli
paer
P. aeruginosa
vs
E. coli
1
101
201
301
401
501
601
701
801
901
1001
1101
1201
1301
1401
1501
1601
1701
1801
1901
2001
2101
2201
2301
2401
2501
2601
2701
2801
2901
3001
3101
3201
3301
3401
3501
3601
3701
3801
3901
4001
4101
4201
4301
4401
4501
4601
4701
4801
<0.3
0.3-0.8
0.8-1.3
>1.3
Genome Alignments - Method
Protein sets from completely genomes
BLAST cross-comparison
Table of Hits
Pairwise Genome Alignment
Local alignment algorithm
Lamarck (gap opening penalty,
gap extension penalty); statistics
with Monte Carlo simulations
Template-Anchored Genome Alignment
Genome Alignments - Statistics
0.5
cpneu-ctra
mjan-mthe
bsub-ecoli
drad-aero
0.4
0.3
0.2
Distribution of conserved gene string lengths
>20
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
0.0
2
0.1
Genome Alignments - Statistics
Pairwise
No.
No. % in % in
alignments: strings genes Gen1 Gen2
all homologs
ecoli-hinf 138
566 13%
ecoli-bsub
89
322
8%
ecoli-mjan 10
30
1%
33%
8%
2%
probable orthologs
ecoli-hinf 105
482 11%
ecoli-bsub
34 168
4%
ecoli-mjan 12
33
1%
28%
4%
2%
Genome Alignments - Statistics
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
0
Not in gene strings
In non-conserved gene strings (directons)
In conserved gene strings
Breakdown of genes
in the genome
Genome Alignments - Statistics
Fraction of the genome in conserved gene strings - from
template-anchored alignments
Minimum Synechocystis sp.
5%
Aquifex aeolicus
Archaeoglobus fulgidus
Escherichia coli
Treponema pallidum
10%
13%
14%
17%
Maximum Thermotoga maritima
Mycoplasma genitalium
23%
24%
Context-Based Prediction of Protein
Functions
A Novel Translation Factor (COG0536)
L21
L27
GTP-binding
GTPase?
translation
factor
Context-Based Prediction of Protein
Functions
A Novel Translation Factor (COG0012)
TGS domain
GTP-binding
containing
translation
GTPase?
factor
Peptidyl-tRNA
hydrolase
Related documents