Download OrthologAnalysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Protein moonlighting wikipedia , lookup

Koinophilia wikipedia , lookup

Transcript
Identification of Ortholog Groups by OrthoMCL
Protein sequences
from
organisms of interest
All-against-all
BLASTP
Between Species:
Reciprocal best similarity pairs
Putative orthologs
Similarity cutoff:
P-value
% overlap
Within Species:
Reciprocal better similarity pairs
(Recent) paralogs
Similarity Matrix
Markov Clustering
Cluster tightness:
Inflation values (I)
Ortholog groups
with (recent) paralogs
Species A
Species B
Paralog
A2
Paralog
Ortholog
A1
B1
200
150
B2
220
Similarity Matrix
A1 A2 B1 B2
A1 ─
200 150 0
A2 200 ─
0
0
B1 150 0
─
220
B2 0
220 ─
0
Similarity score
Markov Clustering (MCL) Algorithm
Matrix Inflation
(entry powering)
Similarity Matrix
Markov Matrix
Matrix Expansion
(matrix powering)
Terminate when
no further change
Final matrix
as clustering
Transition
probability
matrix
Application of OrthoMCL to Plasmodium,
human and other model organisms
Plasmodium falciparum,
Human, Arabidopsis,
Worm, Fly, Yeast
E. coli
…
160 all
included
114
Plasmodium
Not human
6241
ortholog groups
551 only
Eukaryotes
1182 only
Metazoa
24 only
Plasmodium
& Arabidopsis
An Example of Gamma-tubulin Ortholog Group
Comparing OrthoMCL with INPARANOID
( two species)
• INPARANOID clusters both orthologs and in-paralogs
from two species by pairwise similarity
– Find two-way best hits from pairwise similarity scores as main
ortholog pair
– Add additional orthologs (in-paralogs) from the same species for
each main ortholog by comparing similarity scores between the
main ortholog with putative in-paralogs with the score between
the main ortholog pair
– Resolve overlapping groups by merging, deleting, dividing them
based on a set of rules
• OrthoMCL can cluster orthologs and in-paralogs from
multiple species
I. Yeast – Worm dataset (estimation)
Yeast: 6358 proteins
Worm: 19774 proteins
OrthoMCL
INPARANOID
4428 proteins:
Yeast: 2158
Worm: 2270
4985 proteins:
Yeast: 2283
Worm: 2702
I=?
3931 same
from both
methods
? (paralog
groups?)
1805 groups
? Coherent
grouping
Coherent groups = same groups + contained groups
∩
Contained groups
INPARANOID group
∩
OrthoMCL group
INPARANOID group
OrthoMCL group
Inflation value (I) regulates cluster tightness
Inflation
(I)
2
#
groups
tight 1892
% seqs
#
with
groups
same
of
grouping
paralogs
*
% seqs
% seqs
with
with
coherent
contained
grouping
grouping*
*
159
80.2
16.9
97.1
1.5
1857
89
82.4
14.8
97.2
1.2
1814
7
85.4
11.7
97.1
1.1 loose 1811
2
85.4
11.9
97.3
* Percentage of 3931 sequences identified by both OrthoMCL and Inparanoid
So, choose I = 1.1 as the optimal inflation value
Possible reasons for including different sequences
BLAST version
BLAST Search
Similarity cutoff
OrthoMCL
INPARANOID
WU-BLAST
NCBI-BLAST
All-against-all,
SEG filtered,
Pairwise
fixed database size
Score>=50bits
P<1e-5
Overlap > 50%
Reciprocal “best”
hits
P-value, percent
identity
Recent paralogs
One-way better
Bi-directional better
within-species
within-species
similarity from
similarity
orthologs
Score
Default parameters:
Similarity cutoff: P-value <1e-5, overlap > 50%
Cluster tightness: Inflation values I =1.1
Yeast: 6358 proteins
Worm: 19774 proteins
OrthoMCL
INPARANOID
3949 proteins:
Yeast: 1927
Worm: 2022
4985 proteins:
Yeast: 2283
Worm: 2702
I = 1.1
3765 same
from both
methods
1614 groups
1805 groups
86.3% same groups
98.1% coherent groups
II. Worm – Fly dataset (test)
OrthoMCL
9623 proteins
Worm: 4997
Fly: 4626
I = 1.1
Worm: 19774 proteins
Fly: 13288 proteins
8856 same
from both
methods
3764 groups
INPARANOID
10100 proteins:
Worm: 5399
Fly: 4761
3988 groups
86% same groups
98% coherent groups
In conclusion: OrthoMCL and INPARANOID have similar clustering
behavior when comparing two species
Comparison of OrthoMCL with EGO
(multiple species)
III. Yeast – Worm – Fly dataset
EGO: TC/NP
BLASTP
10260 seqs
Protein sequences
4776 proteins
Remove redundancy
4776 unique proteins formed 3125 unique groups
OrthoMCL: 12459 proteins formed 4033 groups
4392 same
proteins
from both
2.3% OrthoMCL
contained in
EGO
44.2%
same groups
93.8% coherent
groups
62% EGO
contained in
OrthoMCL
An Example:
EGO Groups contained by OrthoMCL Groups
Worm
Hsp-1
Fly
Hsc70-1
Hsc70-4
Yeast
SSA1
SSA2
SSA3
SSA4
EGO : Hsp-1, Hsc70-4, SSA2
OrthoMCL: Hsp-1, Hsc70-1, Hsc70-4, SSA1, SSA2, SSA3, SSA4
Back to Apicomplexa …
5333 Proteins
1421 orthologous
to yeast
1693 orthologous
to Arabidopsis
1846 orthologous
to the other
6 organisms
1771 orthologous
to fly, worm
or human
483 orthologous
to E. coli
1824 nonorthologous
to human
Summary
• OrthoMCL automatically delineates the many-to-many
orthologous relationship across multiple eukaryotic
genomes
• When applied to pairwise comparison of two species, the
performance of OrthoMCL is comparable to
INPARANOID which was designed for comparing two
species
• When applied to multiple species and compared with
EGO database, OrthoMCL tend to identify more
orthologous genes
• The underlying object-based relational storage model
permits integration with organismal data and queries
based on user-defined species distribution provides a
snapshot of shared/diversified biological processes
across species
Related Posters and Reference
• 114A. Web-Based Biological Discovery using an
Integrated Database.
• 146A. The Genomics Unified Schema (GUS).
• 170A. TESS-II: Describing and Finding Gene Regulatory
Sequences with Grammars.
• Remm et al. Automatic Clustering of Orthologs and Inparalogs from Pairwise Species Comparisons.
J.MOL.Biol. (2001) 314
• Lee et al. Cross-Referencing Eukaryotic Genomes: TIGR
Orthologous Gene Alignments (TOGA). Genome Res.
(2002) 12
• Enright et al. An efficient algorithm for large-scale
detection of protein families. Nucleic Acids Res. (2002) 30