Download Large scale genomes comparisons

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Large scale genomes comparisons
Bioinformatics aspects
(Introduction)
Fredj Tekaia
Institut Pasteur
[email protected]
EMBO Bioinformatic and Comparative Genome Analysis Course
Stazione Zoologica Anton Dohrn, Naples, Italy
May 7 - 19, 2012
Large-scale genome comparisons:
Comparing a genome (in terms of
whole sequence, whole set of
predicted genes or whole set of
predicted proteins) to itself (intraspecies comparisons) or to another
genome (inter-species comparisons).
Large scale genome comparisons
-Duplication;
-Conservation;
-Specificity (species-specific genes, proteins);
-Paralogues, orthologues;
-Families (clusters) of paralogues, of orthologues;
-Genomes organisations (duplicated, conserved genes);
-Search for shared motifs in proteins of the same cluster;
-Protein conservation profiles;
-Selection pressure analyses
(synonymous, non synonymous substitutions,..),….
Evolution
Speciation - Duplication
G
Duplication
•Speciation
•Duplication
Time
G1 G2
Speciation
•Inparalogs
•Orthologs
Duplication •Outparalogs
A-G1 A-G2
B-G1
B-G21
B-G22
•Loss of genes
outparalogs
outparalogs
orthologs
A
inparalogs
B
Predict these events by comparing genomes?
Orthologs / Paralogs
• How to detect orthologous genes?
- easy way: best reciprocal hit (RBH)
1a
2.1a
1b
2.1b
2.2a
2.2b
3a
3b
Organism A
Organism B
• Large scale comparative analysis of predicted proteomes
revealed significant evolutionary processes:
Expansion, Exchange and Deletion.
Evolutionary processes include
Ancestor
Expansion*
Phylogeny*
genesis
duplication
HGT
species genome
Exchange* selection*
HGT
loss
Deletion*
S. cerevisiae genome
Colours reveal Duplications
Kellis et al. Nature, 2004
Duplication
Speciation
Deletion
Actual content of the 2 copies
Reconstruction of the ancestral
organization
Kellis et al. Nature, 2004
Original version
Actual version
Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206.
Genome duplication.
a, Distribution of Ks values of
duplicated genes in Tetraodon (left)
and Takifugu (right) genomes.
Duplicated genes broadly belong to
two categories, depending on their
Ks value being below or higher than
0.35 substitutions per site since the
divergence between the two puffer
fish (arrows).
b, Global distribution of ancient
duplicated genes (Ks > 0.35) in the
Tetraodon genome. The 21
Tetraodon chromosomes are
represented in a circle in numerical
order and each line joins duplicated
genes at their respective position on
a given pair of chromosomes.
Jaillon et al. Nature 431, 946-857. 2004.
Search for similarity
Methods:
• Important to know how algorithms that allow sequence
comparisons work,
• There are many comparisons methods,
• Among most used:
• BLAST
• FASTA
• Smith-Waterman algorithm
dynamic programming method
• HMM (Hidden Markov Model)
Sequence Comparaisons
V I T K L G T C V G S
V I S . . . T Q V G S
• Identity
• Similarity
• Homology
V I T K L G T C V G S
V . S K . G T Q V . S
Comparison of 2 sequences
• Aims at finding the optimal alignment: the one that shows most
similar regions and regions that are less similar.
• In describing sequence comparisons, three different terms are
commonly used :
Identity, Similarity and Homology.
Need for a score that evaluates:
- matches
- mismatches
- gaps
and a method that evaluates the numerous possible alignments.
Homology
• Sequence homology underlies common ancestry and sequence
conservation;
• Homology can be inferred, under suitable conditions from sequence
similarity ;
• The main objective of sequence similarity searching studies aims at
inferring homology between sequences;
• Homology is not a measure.
It is an all or none relashionship (i.e homology exits or does not exist.
Expressions like : significant or weak homology are meaningless!).
Sequence similarity is a measure of the matching characters in an
alignment, whereas homology is a statement of common evolutionary
origin.
Local Alignment
A
B
Local alignment
Global Alignment
A
B
Global alignment
Compare one query sequence to a BLAST
formatted database
Amino acid scoring schemes
(substitution matrices)
• All algorithms comparing protein sequences rely on some schemes to
score the equivalence of each of the 210 possible pairs of amino acids.
As a result : what a local alignment program produces depends
strongly upon the scores it uses.
• implicitly a scheme may represent a particular theory of evolution,
• choice of a matrix can strongly influence the outcome of an analysis.
•The scores in the matrix are integer values which assign a positive
score to identical or similar character pairs, and a negative value to
dissimilar character pairs.
Sij = (ln(qij/pipj))/u; qij are target frequencies for aligned pairs of amino
acids, the pi and pj are background frequencies, and u is a statistical
parameter.
BLOSUM62 Clustered Scoring Matrix in 1/2 Bit Units
# Cluster Percentage: >= 62
# Lowest score = -4, Highest score = 11
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
B
Z
X
*
A
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
-2
-1
0
-4
R
-1
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
-1
0
-1
-4
N
-2
0
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
3
0
-1
-4
D
-2
-2
1
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
4
1
-1
-4
C
0
-3
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
-3
-3
-2
-4
Q
-1
1
0
0
-3
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
0
3
-1
-4
E
-1
0
0
2
-4
2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
1
4
-1
-4
G
0
-2
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
-1
-2
-1
-4
H
-2
0
1
-1
-3
0
0
-2
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
0
0
-1
-4
I
-1
-3
-3
-3
-1
-3
-3
-4
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
-3
-3
-1
-4
L
-1
-2
-3
-4
-1
-2
-3
-4
-3
2
4
-2
2
0
-3
-2
-1
-2
-1
1
-4
-3
-1
-4
K
-1
2
0
-1
-3
1
1
-2
-1
-3
-2
5
-1
-3
-1
0
-1
-3
-2
-2
0
1
-1
-4
M
-1
-1
-2
-3
-1
0
-2
-3
-2
1
2
-1
5
0
-2
-1
-1
-1
-1
1
-3
-1
-1
-4
F
-2
-3
-3
-3
-2
-3
-3
-3
-1
0
0
-3
0
6
-4
-2
-2
1
3
-1
-3
-3
-1
-4
P
-1
-2
-2
-1
-3
-1
-1
-2
-2
-3
-3
-1
-2
-4
7
-1
-1
-4
-3
-2
-2
-1
-2
-4
S
1
-1
1
0
-1
0
0
0
-1
-2
-2
0
-1
-2
-1
4
1
-3
-2
-2
0
0
0
-4
T
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
1
5
-2
-2
0
-1
-1
0
-4
W
-3
-3
-4
-4
-2
-2
-3
-2
-2
-3
-2
-3
-1
1
-4
-3
-2
11
2
-3
-4
-3
-2
-4
Y
-2
-2
-2
-3
-2
-1
-2
-3
2
-1
-1
-2
-1
3
-3
-2
-2
2
7
-1
-3
-2
-1
-4
V
0
-3
-3
-3
-1
-2
-2
-3
-3
3
1
-2
1
-1
-2
-2
0
-3
-1
4
-3
-2
-1
-4
B
-2
-1
3
4
-3
0
1
-1
0
-3
-4
0
-3
-3
-2
0
-1
-4
-3
-3
4
1
-1
-4
Z
-1
0
0
1
-3
3
4
-2
0
-3
-3
1
-1
-3
-1
0
-1
-3
-2
-2
1
4
-1
-4
X
0
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
0
0
-2
-1
-1
-1
-1
-1
-4
*
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
-4
1
• BLOSUM matrices (Henikoff, S., and Henikoff, J., G. (1992))
BlosumX denotes a matrix obtained from alignments of clustered
sequence segments with more than X% identity.
Examples :
- Blosum62 is obtained from clustered sequences with identity greater than 62%.
- Blosum80 is obtained from clustered sequences with identity greater than 80%.
Which substitution matrix to choose?
Blosum80
PAM10
Less divergent
Blosum62
PAM120
<------ searching ------>
Blosum45
PAM250
More divergent
BLAST
(Basic Local Alignment Search Tool)
Nucleotide BLAST
• Nucleotide query - nucleotide database [blastn]
Protein BLAST
• Protein query - protein database [blastp]
• PSI-BLAST Position Specific Iterative BLAST
Translated BLAST Searches
• Nucleotide query - Protein db [blastx]
• Protein query - Translated db [tblastn]
• Nucleotide query - Translated db [tblastx]
Seach for conserved domains
• Search the Conserved Domain Database [RPS-BLAST]
Pairwise BLAST
• BLAST 2 Sequences
Blast algorithm:
(1) Query sequence: list of high scoring words of length w.
Query Sequence of length L
Maximum of L-w+1 words; w=3,11
.....
List the words that score at least T
using a substitution matrix (Bosum62
or PAM250,...)
(2) Compare the word list to the database and identify exact matches.
DB sequences
.....
Extract matches of words from word list.
(3)For each word match, extend alignment in both directions to
find alignments with scores > S
Maximal Segment Pairs (MSPs): HSPs
BLASTP 2.2.1 [Apr-13-2001]
............................
Query= YAL005c SSA1 heat shock protein of HSP70 family,
cytosolic
(642 letters)
Database:
S. cerevisiae proteome version 22/05/2002
5829 sequences; 2,798,770 total letters
................................................
Sequences producing significant alignments:
Score
E
(bits) Value
YAL005c
SSA1 heat shock protein of HSP70 family, cyt...
YLL024c
SSA2 heat shock protein of HSP70 family, cyt...
YER103w
SSA4 heat shock protein of HSP70 family, cyt...
YBL075c
SSA3 heat shock protein of HSP70 family, cyt...
YJL034w
KAR2 nuclear fusion protein
YDL229w
SSB1 heat shock protein of HSP70 family
YNL209w
SSB2 heat shock protein of HSP70 family, cyt...
YJR045c
SSC1 mitochondrial heat shock protein 70-rel...
YEL030w
heat shock protein of HSP70 family
YLR369w
SSQ1 mitochondrial heat shock protein 70
YBR169c
SSE2 heat shock protein of the HSP70 family
YPL106c
SSE1 heat shock protein of HSP70 family
YHR064c
regulator protein involved in pleiotro...
YKL073w
LHS1 chaperone of the ER lumen
YLR135w
subunit of SLX1P/Ybr228p-SLX4P complex...
...................
674
663
589
588
480
428
427
336
324
296
173
172
143
100
33
0.0
0.0
e-169
e-169
e-136
e-120
e-120
5e-93
2e-89
4e-81
7e-44
1e-43
6e-35
4e-22
0.13
>YLL024c
SSA2
P14.1.f13.1 heat shock protein of HSP70 family, cytosolic
Length =
639
Score = 663 bits (2508), Expect = 0.0
Identities = 558/607 (91%), Positives = 570/607 (92%)
Query: 1
MSKAVGIDLGTTYSCVAHFANDRVDIIANDQGNRTTPSFVAFTDTERLIGDAAKNQAAMN 60
MSKAVGIDLGTTYSCVAHF+NDRVDIIANDQGNRTTPSFV+FTDTERLIGDAAKNQAAMN
Sbjct: 1
MSKAVGIDLGTTYSCVAHFSNDRVDIIANDQGNRTTPSFVGFTDTERLIGDAAKNQAAMN 60
..........................................................................
Query: 601 IMSKLYQ 607
IMSKLYQ
Sbjct: 601 IMSKLYQ 607
>YER103w
SSA4
P14.1.f13.1 heat shock protein of HSP70 family, cytosolic
Length =
642
Score = 589 bits (2224), Expect = e-169
Identities = 473/609 (77%), Positives = 539/609 (87%), Gaps = 3/609 (0%)
Query: 1
MSKAVGIDLGTTYSCVAHFANDRVDIIANDQGNRTTPSFVAFTDTERLIGDAAKNQAAMN 60
MSKAVGIDLGTTYSCVAHFANDRV+IIANDQGNRTTPS+VAFTDTERLIGDAAKNQAAMN
Sbjct: 1
MSKAVGIDLGTTYSCVAHFANDRVEIIANDQGNRTTPSYVAFTDTERLIGDAAKNQAAMN 60
....................................................................
Query: 598 ANPIMSKLY 606
ANPIMSK+Y
Sbjct: 601 ANPIMSKFY 609
>YBL075c
SSA3
P14.1.f13.1 heat shock protein of HSP70 family, cytosolic
Length =
649
Score = 588 bits (2220), Expect = e-169
Identities = 467/609 (76%), Positives = 539/609 (87%), Gaps = 3/609 (0%)
Query: 1
MSKAVGIDLGTTYSCVAHFANDRVDIIANDQGNRTTPSFVAFTDTERLIGDAAKNQAAMN 60
MS+AVGIDLGTTYSCVAHF+NDRV+IIANDQGNRTTPS+VAFTDTERLIGDAAKNQAA+N
Sbjct: 1
MSRAVGIDLGTTYSCVAHFSNDRVEIIANDQGNRTTPSYVAFTDTERLIGDAAKNQAAIN 60
........................................
Query: 598 ANPIMSKLY 606
ANPIM+K+Y
Sbjct: 601 ANPIMTKFY 609
>YJL034w
KAR2
682
P14.1.f13.1 nuclear fusion protein
Length =
...........................................
Large-scale proteome comparisons
Systematic comparisons
Comparenewg2eachg ng list
Compareeachg2newg ng list
blastp, blosum62, SEG filter
bestgs1ng
allgs1ng
bestgs2ng
allgs2ng
NG
ro
new proteome
bestgsnng
allgsnng
bestnggs i
NG1 size GSij blast p
GS 1
proteome1
bestnggs1
allnggs1
GS 2
proteome2
bestnggs2
allnggs2
GS n
proteomen
HS/IS/NS
- fast determination of significant matches;
allnggsi
NG1 size GSij blast p
NG2 size GSik blast p
multiple matches;
bestnggsn
allnggsn
HS/IS/NS
HS/IS/NS
orthologs determination;
The expected number of HSPs with score at least S is given by: E = Kmne-S. m
and n are sequence and database lengths.
Systematic Analysis of Completely Sequenced
Organisms
• In silico species specific comparisons;
• Degree of ancestral duplication and of ancestral
conservation between pairs of species;
• Families of paralogs (Partition-mcl);
• Families of orthologs (Partition-mcl);
• Determination of the protein dictionary (orthologs);
• Determination of protein conservation profiles;
Working Examples
Comparing S. cerevisiae (SC) genome with C. elegans (CE) genome
SC vs SC
BLASTP 2.2.1 [Apr-13-2001]
............................
Query= YAL005c SSA1 heat shock protein of HSP70 family,
cytosolic
(642 letters)
Database:
S. cerevisiae proteome version 22/05/2002
5829 sequences; 2,798,770 total letters
................................................
Sequences producing significant alignments:
Score
E
(bits) Value
YAL005c
SSA1 heat shock protein of HSP70 family, cyt...
YLL024c
SSA2 heat shock protein of HSP70 family, cyt...
YER103w
SSA4 heat shock protein of HSP70 family, cyt...
YBL075c
SSA3 heat shock protein of HSP70 family, cyt...
YJL034w
KAR2 nuclear fusion protein
YDL229w
SSB1 heat shock protein of HSP70 family
YNL209w
SSB2 heat shock protein of HSP70 family, cyt...
YJR045c
SSC1 mitochondrial heat shock protein 70-rel...
YEL030w
heat shock protein of HSP70 family
YLR369w
SSQ1 mitochondrial heat shock protein 70
YBR169c
SSE2 heat shock protein of the HSP70 family
YPL106c
SSE1 heat shock protein of HSP70 family
YHR064c
regulator protein involved in pleiotro...
YKL073w
LHS1 chaperone of the ER lumen
YLR135w
subunit of SLX1P/Ybr228p-SLX4P complex...
...................
674
663
589
588
480
428
427
336
324
296
173
172
143
100
33
0.0
0.0
e-169
e-169
e-136
e-120
e-120
5e-93
2e-89
4e-81
7e-44
1e-43
6e-35
4e-22
0.13
bestscsc
YAL002w
YAL003w
YAL004w
YAL005c
YAL007c
( SC / SC )
1176
206
215
642
215
allscsc
YLL024c
YOR016c
NS
NS
NS
HS 0.0
HS 1e-44
( SC / SC )
YAL002w
1176
-
NS
YAL003w
206
-
NS
YAL004w
215
-
NS
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
642
642
642
642
642
642
642
642
642
642
642
642
642
YLL024c
YER103w
YBL075c
YJL034w
YDL229w
YNL209w
YJR045c
YEL030w
YLR369w
YBR169c
YPL106c
YHR064c
YKL073w
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
0.0
0.0
0.0
e-147
e-130
e-130
e-100
2e-96
1e-87
2e-47
4e-47
7e-38
5e-24
YAL007c
YAL007c
YAL007c
YAL007c
215
215
215
215
YOR016c
YGL200c
YHR110w
YDL018c
HS
IS
IS
IS
1e-44
5e-05
0.017
0.021
- Paralogs - multiple matches
- Partitions/clustering
Multiple matches of sc in sc
ORF
matches in sc
YAL005c
13
YAL007c
1
YDR214w
1
YDR216w
2
YDR399w
1
YDR406w
9
YDR409w
1
YCR040w
1
YKL218c
1
YKL219w
14
YKL220c
6
YKL221w
2
YKL222c
3
YKL223w
5
YKL224c
22
YKR001c
2
YKR003w
5
YBR104w
6
YBR105c
1
YKR013w
2
YKR014c
13
....................................
..........................
Max : YDR477w
77
SC/CE
bestscce
YAL002w
YAL003w
YAL004w
YAL005c
YAL007c
YAL009w
YAL019w
YAL020c
YAL021c
CE/SC
(SC / CE)
1176
206
215
642
215
259
1131
333
837
allscce
bestcesc
C42C1.4
F54H12.6
F26D10.3
F57B10.5
F16D3.7
M03C11.8
F07C3.4
ZC518.3
HS
HS
NS
HS
HS
IS
HS
IS
HS
2e-15
4e-22
e-172
9e-08
0.013
7e-92
7e-04
5e-47
(SC / CE)
1259
213
640
640
203
516
1038
356
949
425
600
allcesc
YAL002w 1176
C42C1.4
HS
2e-15
YAL003w 206
YAL003w 206
F54H12.6 HS
Y41E3.10 HS
4e-22
2e-17
YAL004w 215
-
NS
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
F26D10.3
F44E5.4
F44E5.5
C12C8.1
C15H9.6
F43E2.8
C37H5.8
F11F1.1
F54C9.2
K09C4.3
T28F3.2
C30C11.4
T24H7.2
T14G8.3
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
642
642
642
642
642
642
642
642
642
642
642
642
642
642
C42C1.4
F54H12.6
F26D10.3
F26D10.3
F57B10.5
F16D3.7
M03C11.8
AC3.1
AC3.2
AC3.3
AC3.4
( CE / SC)
e-172
e-153
e-153
e-152
e-148
e-144
e-104
1e-77
4e-51
4e-47
2e-45
7e-43
2e-34
8e-33
YAL002w
YAL003w
YER103w
YER103w
YAL007c
YHL003c
YAL019w
YLR189c
YNL326c
HS
HS
HS
HS
HS
IS
HS
NS
IS
NS
HS
8e-16
4e-20
e-174
e-174
7e-13
9e-04
2e-87
0.038
1e-12
(CE / SC )
C42C1.4
1259
YAL002w
HS
8e-16
F54H12.6
213
YAL003w
HS
4e-20
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
640
640
640
640
640
640
640
640
640
640
640
640
640
640
YER103w
YBL075c
YLL024c
YAL005c
YJL034w
YDL229w
YNL209w
YJR045c
YEL030w
YLR369w
YPL106c
YBR169c
YHR064c
YKL073w
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
e-174
e-174
e-172
e-171
e-141
e-129
e-129
e-100
2e-97
1e-83
2e-45
5e-45
8e-36
3e-22
Reciprocal Best Hits (RBH)
segmatchSCCE
Test siz
Hit
siz e-val %id %sim gap Ssiz dT eT dH eH
YAL002w 1176 C42C1.4 1259 5e-14 16 44 7 674 438 1111 547 1196
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
642
642
642
642
642
642
642
642
642
642
642
642
642
642
642
F26D10.3
F44E5.5
F44E5.4
C12C8.1
C15H9.6
F43E2.8
C37H5.8
F11F1.1b
F11F1.1a
F54C9.2
K09C4.3
K09C4.3
C30C11.4
T24H7.2
T14G8.3
640
645
645
643
661
657
657
607
614
469
310
310
776
925
926
1e-159
1e-142
1e-142
1e-141
1e-137
1e-134
1e-96
1e-73
8e-72
3e-47
2e-43
1e-04
1e-39
1e-31
3e-30
73
63
63
62
60
58
46
36
36
38
71
54
26
24
24
84
79
79
79
78
76
67
60
60
66
88
70
50
50
51
0
0
0
0
1
1
2
0
2
2
0
605 3 607 5 613
605 3 607 5 611
605 3 607 5 611
605 3 607 5 611
603 5 607 36 641
606 1 606 29 637
606 2 607 31 632
599 4 602 2 600
599 4 602 2 607
379 2 380 52 433
186 4 189 6 192
61 327 387 189 249
8 600 5 604 4 647
3 506 4 509 26 548
6 510 4 513 28 560
Conclusion
Large-scale analyses of Completely sequenced genomes allow a
systematic vision of genes and genome organization and their
macro as well their micro evolutions.
Starting step for sophiticated evolutionary analyses that will be
dealt with during this course.
Practical sessions
(see text)
Related documents