Download The Pfam protein family database: an introduction

Document related concepts
no text concepts found
Transcript
The Pfam protein family
database: an introduction
Marco Punta
TUM, December 2013
Evolution
TUM, December 2013
Homology
P1B
P1C
B
C
A
P1A
TUM, December 2013
Homology
http://upload.wikimedia.org/wikipedia/commons/thumb/7/72/Gene-duplication.png/220px-Gene-duplication.png
TUM, December 2013
Homology
P1B
P1C
B
C
A
P1A
TUM, December 2013
P1C'
Homology
homologs
orthologs
P1B
P1C
B
C
A
P1A
TUM, December 2013
paralogs
P1C'
Homology
Xenologs -> horizontal gene transfer
http://textbookofbacteriology.net/resantimicrobial_3.html
TUM, December 2013
Point I : Evolutionary relationships between proteins
Definition: we call families groups of evolutionary related proteins
TUM, December 2013
Why do we care?
TUM, December 2013
An example
Point II: Proteins in the same family can retain
common functional attributes;
TUM, December 2013
Myoglobins
Myoglobin:
Serves as a reserve supply of oxygen and facilitates the movement of oxygen within muscles.
Human: 1
Mouse: 1
!
Human: 61
MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE 60!
MGLSDGEWQLVLNVWGKVEAD GHGQEVLI LFK HPETL KFDKFK LKSE MK SE!
MGLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNLKSEEDMKGSE 60!
DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH 120!
DLKKHG TVLTALG ILKKKG H AEI PLAQSHATKHKIPVKYLEFISE II VL
H!
DLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATKHKIPVKYLEFISEIIIEVLKKRH 120!
Mouse: 61
!
Human: 121 PGDFGADAQGAMNKALELFRKDMASNYKELGFQG 154!
GDFGADAQGAM KALELFR D A YKELGFQG!
Mouse: 121 SGDFGADAQGAMSKALELFRNDIAAKYKELGFQG 154!
TUM, December 2013
Why do we care? (II)
TUM, December 2013
Gap between sequenced and annotated proteins
Number of seq in UniProt 2013_11
:
48,180,424
Number of seq in Swissprot 2013_11 :
541,762
Number of sequences in PDB 2013_12_03:
TUM, December 2013
53,888 (100% RR)
Point III: Vast majority of proteins have not yet been
experimentally functionally characterised.
TUM, December 2013
Detecting Homology
Sequence similarity
Human: 1
Mouse: 1
!
Human: 61
MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE 60!
MGLSDGEWQLVLNVWGKVEAD GHGQEVLI LFK HPETL KFDKFK LKSE MK SE!
MGLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNLKSEEDMKGSE 60!
DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH 120!
DLKKHG TVLTALG ILKKKG H AEI PLAQSHATKHKIPVKYLEFISE II VL
H!
DLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATKHKIPVKYLEFISEIIIEVLKKRH 120!
Mouse: 61
!
Human: 121 PGDFGADAQGAMNKALELFRKDMASNYKELGFQG 154!
GDFGADAQGAM KALELFR D A YKELGFQG!
Mouse: 121 SGDFGADAQGAMSKALELFRNDIAAKYKELGFQG 154!
TUM, December 2013
Detecting Homology
Sequence similarity
Human: 1
Mouse: 1
!
Human: 61
MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE 60!
MGLSDGEWQLVLNVWGKVEAD GHGQEVLI LFK HPETL KFDKFK LKSE MK SE!
MGLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNLKSEEDMKGSE 60!
DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH 120!
DLKKHG TVLTALG ILKKKG H AEI PLAQSHATKHKIPVKYLEFISE II VL
H!
DLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATKHKIPVKYLEFISEIIIEVLKKRH 120!
Mouse: 61
!
Human: 121 PGDFGADAQGAMNKALELFRKDMASNYKELGFQG 154!
GDFGADAQGAM KALELFR D A YKELGFQG!
Mouse: 121 SGDFGADAQGAMSKALELFRNDIAAKYKELGFQG 154!
20154 possibilities!!
TUM, December 2013
Detecting Homology
Structural similarity
2G2X: 1
MAYWLMKSEPDELSIEALARLGEARWDGVRNYQARNFLRAMSVGDEFFFYH-----SSCP 55!
MAYWL
D
W
Y
N
VGD
Y
!
MAYWLCITNEDNWKVIKEKKI----WGVAERY--KNTINKVKVGDKLIIYEIQRSGKDYK 57!
2P5D: 4
!
2G2X: 56 QPGIAGIARITRAAYPD------PTALDPESHY 82!
P I G
Y D
PT
P
!
2P5D: 58 PPYIRGVYEVVSEVYKDSSKIFKPTPRNPNEKF 90!
!
TUM, December 2013
Detecting Homology
Structural similarity
2g2x
TUM, December 2013
Detecting Homology
Structural similarity
2g2x
TUM, December 2013
Detecting Homology
Structural similarity
TUM, December 2013
Detecting Homology
Structural similarity
TUM, December 2013
Detecting Homology
Genome context
http://www.microbesonline.org
TUM, December 2013
Detecting Homology
Protein context
http://www.microbesonline.org
TUM, December 2013
Detecting Homology
Point IV: we can detect homology in different ways
sequence similarity, structural similarity, etc.
TUM, December 2013
Pfam in 4 moves
Point I : Evolutionary relationships (homology) between some
proteins (homologous groups=families)
Point II: Proteins in the same family can retain common functional
attributes;
Point III: vast majority of proteins have not yet been experimentally
functionally characterised.
Point IV: we can detect homology via sequence similarity
TUM, December 2013
Pfam in 4 moves
What
Point I : Evolutionary relationships (homology) between some
proteins (homologous groups=families)
Why
Point II: Proteins in the same family can retain common functional
attributes;
Point III: vast majority of proteins have not yet been experimentally
functionally characterised.
How
Point IV: we can detect homology via sequence similarity
TUM, December 2013
The Pfam Database of protein families
TUM, December 2013
Pfam families are:
•  Groups of sequence-conserved
protein regions
Punta et al. NAR 2012
Why ‘regions’?
Query= P29973.3 CNGA1_HUMAN cGMP-gated cation channel alpha-1!
(690 letters)!
!
!
!
>Q13237.1 KGP2_HUMAN cGMP-dependent protein kinase 2 (EC 2.7.11.12)!
Length = 762!
!
Score = 41.6 bits (96), Expect = 1e-07!
Identities = 30/110 (27%), Positives = 54/110 (49%), Gaps = 11/110 (10%)!
!
Query: 474 LKKVRIFADCEAGLLVELVLKLQPQVYSPGDYICKKGDIGREMYIIKEGKLAVV----AD 529!
L+ V + +
L +++ L+ + Y GDYI ++G+ G
+I+ +GK+ V
!
Sbjct: 281 LRSVSLLKNLPEDKLTKIIDCLEVEYYDKGDYIIREGEEGSTFFILAKGKVKVTQSTEGH 340!
!
Query: 530 DGVTQFVVLSDGSYFGEISILNIKGSKAGNRRTANIKSIGYSDLFCLSKD 579!
D
L G YFGE +++
+ + R+ANI +
+D+ CL D!
Sbjct: 341 DQPQLIKTLQKGEYFGEKALI------SDDVRSANIIA-EENDVACLVID 383!
!
TUM, December 2013
Why ‘regions’?
Query= P29973.3 CNGA1_HUMAN cGMP-gated cation channel alpha-1!
(690 letters)!
!
!
!
>Q13237.1 KGP2_HUMAN cGMP-dependent protein kinase 2 (EC 2.7.11.12)!
Length = 762!
!
Score = 41.6 bits (96), Expect = 1e-07!
Identities = 30/110 (27%), Positives = 54/110 (49%), Gaps = 11/110 (10%)!
!
Query: 474 LKKVRIFADCEAGLLVELVLKLQPQVYSPGDYICKKGDIGREMYIIKEGKLAVV----AD 529!
L+ V + +
L +++ L+ + Y GDYI ++G+ G
+I+ +GK+ V
!
Sbjct: 281 LRSVSLLKNLPEDKLTKIIDCLEVEYYDKGDYIIREGEEGSTFFILAKGKVKVTQSTEGH 340!
!
Query: 530 DGVTQFVVLSDGSYFGEISILNIKGSKAGNRRTANIKSIGYSDLFCLSKD 579!
D
L G YFGE +++
+ + R+ANI +
+D+ CL D!
Sbjct: 341 DQPQLIKTLQKGEYFGEKALI------SDDVRSANIIA-EENDVACLVID 383!
!
TUM, December 2013
Why ‘regions’?
Query= P29973.3 CNGA1_HUMAN cGMP-gated cation channel alpha-1!
(690 letters)!
!
cNMP_binding
!
!
>Q13237.1 KGP2_HUMAN cGMP-dependent protein kinase 2 (EC 2.7.11.12)!
Length = 762!
!
!
!
!
Score = 41.6 bits (96), Expect = 1e-07!
Identities = 30/110 (27%), Positives = 54/110 (49%), Gaps = 11/110 (10%)!
!
Query: 474 LKKVRIFADCEAGLLVELVLKLQPQVYSPGDYICKKGDIGREMYIIKEGKLAVV----AD 529!
L+ V + +
L +++ L+ + Y GDYI ++G+ G
+I+ +GK+ V
!
Sbjct: 281 LRSVSLLKNLPEDKLTKIIDCLEVEYYDKGDYIIREGEEGSTFFILAKGKVKVTQSTEGH 340!
!
Query: 530 DGVTQFVVLSDGSYFGEISILNIKGSKAGNRRTANIKSIGYSDLFCLSKD 579!
D
L G YFGE +++
+ + R+ANI +
+D+ CL D!
Sbjct: 341 DQPQLIKTLQKGEYFGEKALI------SDDVRSANIIA-EENDVACLVID 383!
!
TUM, December 2013
Why ‘regions’?
Query= P29973.3 CNGA1_HUMAN cGMP-gated cation channel alpha-1!
(690 letters)!
!
cNMP_binding
!
!
>Q13237.1 KGP2_HUMAN cGMP-dependent protein kinase 2 (EC 2.7.11.12)!
Length = 762!
!
!
cNMP_binding cNMP_binding
!
!
Score = 41.6 bits (96), Expect = 1e-07!
Identities = 30/110 (27%), Positives = 54/110 (49%), Gaps = 11/110 (10%)!
!
Query: 474 LKKVRIFADCEAGLLVELVLKLQPQVYSPGDYICKKGDIGREMYIIKEGKLAVV----AD 529!
L+ V + +
L +++ L+ + Y GDYI ++G+ G
+I+ +GK+ V
!
Sbjct: 281 LRSVSLLKNLPEDKLTKIIDCLEVEYYDKGDYIIREGEEGSTFFILAKGKVKVTQSTEGH 340!
!
Query: 530 DGVTQFVVLSDGSYFGEISILNIKGSKAGNRRTANIKSIGYSDLFCLSKD 579!
D
L G YFGE +++
+ + R+ANI +
+D+ CL D!
Sbjct: 341 DQPQLIKTLQKGEYFGEKALI------SDDVRSANIIA-EENDVACLVID 383!
!
TUM, December 2013
Why ‘regions’?
Query= P29973.3 CNGA1_HUMAN cGMP-gated cation channel alpha-1!
(690 letters)!
!
cNMP_binding
!
!
>Q13237.1 KGP2_HUMAN cGMP-dependent protein kinase 2 (EC 2.7.11.12)!
Length = 762!
!
!
cNMP_binding cNMP_binding
!
!
Score = 41.6 bits (96), Expect = 1e-07!
Identities = 30/110 (27%), Positives = 54/110 (49%), Gaps = 11/110 (10%)!
!
Query: 474 LKKVRIFADCEAGLLVELVLKLQPQVYSPGDYICKKGDIGREMYIIKEGKLAVV----AD 529!
L+ V + +
L +++ L+ + Y GDYI ++G+ G
+I+ +GK+ V
!
Sbjct: 281 LRSVSLLKNLPEDKLTKIIDCLEVEYYDKGDYIIREGEEGSTFFILAKGKVKVTQSTEGH 340!
!
Query: 530 DGVTQFVVLSDGSYFGEISILNIKGSKAGNRRTANIKSIGYSDLFCLSKD 579!
D
L G YFGE +++
+ + R+ANI +
+D+ CL D!
Sbjct: 341 DQPQLIKTLQKGEYFGEKALI------SDDVRSANIIA-EENDVACLVID 383!
!
TUM, December 2013
Why ‘regions’?
Query= P29973.3 CNGA1_HUMAN cGMP-gated cation channel alpha-1!
(690 letters)!
!
cNMP_binding
!
!
>Q13237.1 KGP2_HUMAN cGMP-dependent protein kinase 2 (EC 2.7.11.12)!
Length = 762!
!
!
cNMP_binding cNMP_binding
!
!
Score = 41.6 bits (96), Expect = 1e-07!
Identities = 30/110 (27%), Positives = 54/110 (49%), Gaps = 11/110 (10%)!
!
Query: 474 LKKVRIFADCEAGLLVELVLKLQPQVYSPGDYICKKGDIGREMYIIKEGKLAVV----AD 529!
L+ V + +
L +++ L+ + Y GDYI ++G+ G
+I+ +GK+ V
!
Sbjct: 281 LRSVSLLKNLPEDKLTKIIDCLEVEYYDKGDYIIREGEEGSTFFILAKGKVKVTQSTEGH 340!
!
Query: 530 DGVTQFVVLSDGSYFGEISILNIKGSKAGNRRTANIKSIGYSDLFCLSKD 579!
D
L G YFGE +++
+ + R+ANI +
+D+ CL D!
Sbjct: 341 DQPQLIKTLQKGEYFGEKALI------SDDVRSANIIA-EENDVACLVID 383!
!
TUM, December 2013
Why ‘regions’?
cNMP_binding
TUM, December 2013
Pfam ‘types’
Domain
Repeat
Motif
Family
TUM, December 2013
Domains and repeats
A
B
•  A - Domain
•  B - Metal stabilised
domain
C
D
•  C - 7 repeats form
domain
•  D - 9 repeats form
domain could be
unlimited number
TUM, December 2013
Not always that easy
Enoyl-CoA hydratase/isomerase family
3bpt
TUM, December 2013
Not always that easy
Enoyl-CoA hydratase/isomerase family
1ef8
TUM, December 2013
Not always that easy
3bpt
TUM, December 2013
Domains and repeats
A
B
•  A - Domain
•  B - Metal stabilised
domain
C
D
•  C - 7 repeats form
domain
•  D - 9 repeats form
domain could be
unlimited number
TUM, December 2013
Motifs
Example: Lipoprotein attachment site, LPAM_1
Alignment coloured by Residue-type
TUM, December 2013
Motifs
Example: GoLoco G-protein regulatory motif
TUM, December 2013
Family
All that is left!
TUM, December 2013
Less than half of Pfam families have known structure
with PDB
no PDB
57%
43%
100%=all Pfam families
TUM, December 2013
TUM, December 2013
Disordered Families
TUM, December 2013
Disordered Families
TUM, December 2013
Disordered Families
PDBid: 2JGC
TUM, December 2013
Beginnings
Sonnhammer et al. Proteins 1997
Beginnings
Sonnhammer et al. Proteins 1997
Building a Pfam family
SEED alignment
representative members
Manually curated
Pfam (http://pfam.sanger.ac.uk/)
Automatically made
SEED sequences
aligned with MAFFT:
aligned with MUSCLE:
mafft.cbrc.jp/alignment/software/
http://www.ebi.ac.uk/Tools/msa/muscle/
Building a Pfam family
SEED alignment
representative members
Profile-HMM
HMMER 3.0
Search UniProt
Manually curated
HMMER3 (http://hmmer.janelia.org/)
Automatically made
Building a Pfam family
SEED alignment
representative members
Profile-HMM
HMMER 3.0
Search UniProt
Scored sequences
Manually curated
HMMER3 (http://hmmer.janelia.org/)
Automatically made
Building a Pfam family
SEED alignment
representative members
Profile-HMM
HMMER 3.0
Search UniProt
QC and re-iteration
Manually curated
HMMER3 (http://hmmer.janelia.org/)
Scored sequences
Automatically made
Building a Pfam family
Family 1
Family 2
TUM, December 2013
Building a Pfam family
Family 1
Family 1
TUM, December 2013
Building a Pfam family
Clan
Family 1
Family 2
TUM, December 2013
Building a Pfam family
Family 1
Family 2
TUM, December 2013
Building a Pfam family
SEED alignment
Profile-HMM
HMMER 3.0
Search UniProt
Literature
annotation
FULL alignment
Manually curated
TUM, December 2013
Automatically made
Pfam Classification – level 1
Protein
TUM, December 2013
Pfam Classification – level 2
Family
Protein
TUM, December 2013
Pfam Classification – level 3
Clan
Family
Protein
TUM, December 2013
How far are we in protein classification?
Pfam 27.0: ~15,000 families
UniProtKB (snapshot 2012_6):
~23,000,000 sequences
TUM, December 2013
Pfam coverage of UniProtKB
100
90
80
Coverage (%)
70
60
UniProtKB
Swiss-Prot
(reviewed)
50
40
30
20
10
0
Amino acid
TUM, December 2013
Sequence
Pfam coverage of Swiss-Prot
100
90
80
Coverage (%)
70
60
UniProtKB
Swiss-Prot
(reviewed)
50
40
30
20
10
0
Amino acid
TUM, December 2013
Sequence
Pfam coverage considering disorder
100
90
80
Disordered
Coverage (%)
70
60
UniProtKB
Swiss-Prot
(reviewed)
50
40
30
20
10
0
Amino acid
Sequence
IUPred: Dosztányi et al. JMB 2005
TUM, December 2013
What is next?
TUM, December 2013
What is next?
15,000 families = 80% UniProtKB
TUM, December 2013
What is next?
15,000 families = 80% UniProtKB
With 3750 more families
we fulfill Pfam dream to cover
100% of the protein universe!!!
TUM, December 2013
Family size in Pfam 27.0
100
90
% of all families
80
70
60
50
40
30
20
10
0
Family size
TUM, December 2013
Targeting human, model organisms and pathogens
Residue in
Pfam-A or
Pfam-B
families
55%
Regions <50
residues
(unlikely to
contain a
domain)
6%
Residues
predicted to
be in signal
peptide
regions
1%
Uncovered by
Pfam
38%
Pfam coverage of the human proteome (Swiss-Prot)
http://xfam.wordpress.com/2013/05/07/pfam-targets-conserved-human-regions/
and Mistry et al. Database 2013
Many disordered regions: are they conserved? can we align them?
Mistry et al. Database 2013
Tomas di Domenico
Proteins do team work (e.g. pathways, organelles)
HDL-mediated lipid transport
Ruth Eberhardt
Acknowledgements
Sanger Institute Pfam Team
Jaina!
Mistry!
Janelia Farm Pfam Team
Rob!
Finn!
TUM, December 2013
Sean!
Eddy!
Tomas di
Domenico!
Alex !
Bateman!
Stockholm Pfam Team
Erik!
Sonnhammer!
Created from titles of 483 papers citing Punta et al. NAR 2012 (source: Scopus)
Created from titles of 483 papers citing Punta et al. NAR 2012 (source: Scopus)
Created from titles of 483 papers citing Punta et al. NAR 2012 (source: Scopus)
FAMILY
INTERACTIONS PROVIDE DEVELOPMENT WITHIN
COMPLEX CHALLENGES
Created from titles of 483 papers citing Punta et al. NAR 2012 (source: Scopus)
How far are we in protein classification?
Pfam 27.0: ~15,000 families
UniProtKB (snapshot 2012_6):
~23,000,000 sequences
HRF, Oct 10th 2013
Using Pfam for large scale sequence analysis
Mistry et al. Acta Cryst. D 2013
Using Pfam for large scale sequence analysis
Mistry et al. Acta Cryst. D 2013
Pfam coverage of UniProtKB
100
90
80
Coverage (%)
70
60
UniProtKB
Swiss-Prot
(reviewed)
50
40
30
20
10
0
Amino acid
HRF, Oct 10th 2013
Sequence
Pfam coverage of Swiss-Prot
100
90
80
Coverage (%)
70
60
UniProtKB
Swiss-Prot
(reviewed)
50
40
30
20
10
0
Amino acid
HRF, Oct 10th 2013
Sequence
Pfam coverage considering disorder
100
90
80
Disordered
Coverage (%)
70
60
UniProtKB
Swiss-Prot
(reviewed)
50
40
30
20
10
0
Amino acid
Sequence
IUPred: Dosztányi et al. JMB 2005
HRF, Oct 10th 2013
What is next?
HRF, Oct 10th 2013
What is next?
15,000 families = 80% UniProtKB
HRF, Oct 10th 2013
What is next?
15,000 families = 80% UniProtKB
With 3750 more families
we fulfill Pfam dream to cover
100% of the protein universe!!!
HRF, Oct 10th 2013
Family size in Pfam 27.0
100
90
% of all families
80
70
60
50
40
30
20
10
0
Family size
HRF, Oct 10th 2013
Targeting human, model organisms and pathogens
Residue in
Pfam-A or
Pfam-B
families
55%
Regions <50
residues
(unlikely to
contain a
domain)
6%
Residues
predicted to
be in signal
peptide
regions
1%
Uncovered by
Pfam
38%
Pfam coverage of the human proteome (Swiss-Prot)
http://xfam.wordpress.com/2013/05/07/pfam-targets-conserved-human-regions/
and Mistry et al. Database 2013
Many disordered regions: are they conserved? can we align them?
Mistry et al. Database 2013
Tomas di Domenico
Proteins do team work (e.g. pathways, organelles)
HDL-mediated lipid transport
Ruth Eberhardt
Is Pfam useful?
We do a lot of preliminary work for you.
You may take a sequence and run it against a database,
Then look at annotations of significant matches and try to have
An idea of what the function of your protein might be.
If the your protein is a close homolog of the best annotated matches
(e.g. the two myoglobins we saw before) this may well be the best approach
(look out for alternatively spliced isoforms!)
If homology is more remote you are likely to find a number of stumbling blocks.
1- the closest annotated sequence matches my sequence only partially (e.g.
My sequence is 250 amino acid long, the other is 100 amino acids long)
1a- my sequence is matching sequences with radically different annotations
Ruth Eberhardt
Acknowledgements
Sanger Institute Pfam Team
Jaina!
Mistry!
Janelia Farm Pfam Team
Rob!
Finn!
HRF, Oct 10th 2013
Sean!
Eddy!
Tomas di
Domenico!
Alex !
Bateman!
Stockholm Pfam Team
Erik!
Sonnhammer!
Digression: how to align protein
sequences
Imagine now that you want to find all homologs to human myoglobin
And that you already know a number of homologs (e.g. from structural
Knowledge): it would be cool to align not pairwise but to the whole
Set of seqs. There is extra info -> matrix is position specific rather than fixed.
Digression: how to align protein sequences
Recipe for aligning sequences:
1) A scoring method
2) A way to find the best score(s) between two seqs
3) A way to estimate significance
Digression: how to align protein
sequences
Recipe for aligning sequences:
1)  Substitution matrix
2)  Dynamic programming
3)  Extreme value distribution statistics
Substitution matrices
Blosum (BLOcks SUbstitution Matrices)
Henikoff and Henikoff PNAS (1992 )
PAM (Point Accepted Mutation matrices)
Dayhoff et al. Nat. Biomed. Res. Found (1978)
Substitution matrices
Human: 1
Mouse: 1
!
Human: 61
MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE 60!
MGLSDGEWQLVLNVWGKVEAD GHGQEVLI LFK HPETL KFDKFK LKSE MK SE!
MGLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNLKSEEDMKGSE 60!
DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKH 120!
DLKKHG TVLTALG ILKKKG H AEI PLAQSHATKHKIPVKYLEFISE II VL
H!
DLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATKHKIPVKYLEFISEIIIEVLKKRH 120!
Mouse: 61
!
Human: 121 PGDFGADAQGAMNKALELFRKDMASNYKELGFQG 154!
GDFGADAQGAM KALELFR D A YKELGFQG!
Mouse: 121 SGDFGADAQGAMSKALELFRNDIAAKYKELGFQG 154!
Substitution matrices
2G2X: 1
MAYWLMKSEPDELSIEALARLGEARWDGVRNYQARNFLRAMSVGDEFFFYH-----SSCP 55!
MAYWL
D
W
Y
N
VGD
Y
!
MAYWLCITNEDNWKVIKEKKI----WGVAERY--KNTINKVKVGDKLIIYEIQRSGKDYK 57!
2P5D: 4
!
2G2X: 56 QPGIAGIARITRAAYPD------PTALDPESHY 82!
P I G
Y D
PT
P
!
2P5D: 58 PPYIRGVYEVVSEVYKDSSKIFKPTPRNPNEKF 90!
!
Substitution matrices
2G2X: 1
MAYWLMKSEPDELSIEALARLGEARWDGVRNYQARNFLRAMSVGDEFFFYH-----SSCP 55!
MAYWL
D
W
Y
N
VGD
Y
!
MAYWLCITNEDNWKVIKEKKI----WGVAERY--KNTINKVKVGDKLIIYEIQRSGKDYK 57!
2P5D: 4
!
2G2X: 56 QPGIAGIARITRAAYPD------PTALDPESHY 82!
P I G
Y D
PT
P
!
2P5D: 58 PPYIRGVYEVVSEVYKDSSKIFKPTPRNPNEKF 90!
!
if L1=300, L2=300: 10179 possible alignments!
Dynamic programming
Given a suitable scoring matrix, we can use dynamic programming
to find the optimal alignment e.g. between two sequences
Why is it called “dynamic programming”?
Dynamic programming
Given a suitable scoring matrix, we can use dynamic programming
to find the optimal alignment e.g. between two sequences
Why is it called “dynamic programming”?
Dynamic programming
Given a suitable scoring matrix, we can use dynamic programming
to find the optimal alignment e.g. between two sequences
Why is it called “dynamic programming”?
Because it is :
“something not even a Congressman could object to”
and
“it’s impossible to use the word dynamic in a pejorative sense”
Bellman R
Eddy SR Nature Biotech 2004
Dynamic programming
Eddy SR Nature Biotech 2004
Global and local alignments
Needleman-Wunsch -> global
Smith-Waterman -> local (BLAST, FASTA)
Evaluating significance
Extreme value distribution:
Ungapped alignments -> proved
Gapped alignments -> overwhelming evidence
Evaluating significance
Extreme value distribution:
Ungapped alignments -> proved
Gapped alignments -> overwhelming evidence
Profile-HMMs: a brief introduction
Profile-HMMs: a brief introduction
Example 2: Football
FC Augsburg 2013/14 Bundesliga Fixtures
Date
Status Home
Score
Aug 10
FT
FC Augsburg
0-4
Aug 17
FT
Werder Bremen
Aug 25
FT
Aug 31
Away Attendance
Competition
Borussia Dort.
30,660
Bundesliga
1-0
FC Augsburg
40,112
Bundesliga
FC Augsburg
2-1
VfB Stuttgart
30,030
Bundesliga
FT
Nurnberg
0-1
FC Augsburg
37,239
Bundesliga
Sep 14
FT
FC Augsburg
2-1
SC Freiburg
28,453
Bundesliga
Sep 21
FT
Hannover 96
2-1
FC Augsburg
39,200
Bundesliga
Sep 27
FT
FC Augsburg
2-2
Borussia Mon.
30,352
Bundesliga
Oct 5
FT
Schalke 04
4-1
FC Augsburg
60,731
Bundesliga
Oct 20
FT
FC Augsburg
1-2
VfL Wolfsburg
27,554
Bundesliga
Oct 26
FT
Bayer Leverkusen
2-1
FC Augsburg
27,811
Bundesliga
Nov 3
FT
FC Augsburg
2-1
Mainz
28,007
Bundesliga
Nov 9
FT
Bayern Munich
3-0
FC Augsburg
71,000
Bundesliga
Profile-HMMs: a brief introduction
Example 2: Football
FC Augsburg 2013/14 Bundesliga Fixtures
Date
Status Home
Score
Aug 10
FT
FC Augsburg
L
0-4
Aug 17
FT
Werder Bremen
Aug 25
FT
Aug 31
Away Attendance
Competition
Borussia Dort.
30,660
Bundesliga
L
1-0
FC Augsburg
40,112
Bundesliga
FC Augsburg
2-1
W
VfB Stuttgart
30,030
Bundesliga
FT
Nurnberg
W
0-1
FC Augsburg
37,239
Bundesliga
Sep 14
FT
FC Augsburg
W
2-1
SC Freiburg
28,453
Bundesliga
Sep 21
FT
Hannover 96
2-1
L
FC Augsburg
39,200
Bundesliga
Sep 27
FT
FC Augsburg
2-2
D
Borussia Mon.
30,352
Bundesliga
Oct 5
FT
Schalke 04
4-1
L
FC Augsburg
60,731
Bundesliga
Oct 20
FT
FC Augsburg
1-2
L
VfL Wolfsburg
27,554
Bundesliga
Oct 26
FT
Bayer Leverkusen
2-1
L
FC Augsburg
27,811
Bundesliga
Nov 3
FT
FC Augsburg
2-1
W
Mainz
28,007
Bundesliga
Nov 9
FT
Bayern Munich
3-0
L
FC Augsburg
71,000
Bundesliga
Profile-HMMs: a brief introduction
Example 2: Football
W D
L
Probabilistic
W
D
L
Profile-HMMs: a brief introduction
Example 2: Football
Our model for Augsburg’s Bundesliga results:
3 states: W, D, L
S(t)=F(S(t-1))
States S connected by probabilities
pij≥0;
p
=1
ij
Σ
j
Profile-HMMs: a brief introduction
Example 2: Football
FC Augsburg 2013/14 Bundesliga Fixtures
Date
Status Home
Score
Aug 10
FT
FC Augsburg
L
0-4
Aug 17
FT
Werder Bremen
Aug 25
FT
Aug 31
Away Attendance
Competition
Borussia Dort.
30,660
Bundesliga
L
1-0
FC Augsburg
40,112
Bundesliga
FC Augsburg
2-1
W
VfB Stuttgart
30,030
Bundesliga
FT
Nurnberg
W
0-1
FC Augsburg
37,239
Bundesliga
Sep 14
FT
FC Augsburg
W
2-1
SC Freiburg
28,453
Bundesliga
Sep 21
FT
Hannover 96
2-1
L
FC Augsburg
39,200
Bundesliga
Sep 27
FT
FC Augsburg
2-2
D
Borussia Mon.
30,352
Bundesliga
Oct 5
FT
Schalke 04
4-1
L
FC Augsburg
60,731
Bundesliga
Oct 20
FT
FC Augsburg
1-2
L
VfL Wolfsburg
27,554
Bundesliga
Oct 26
FT
Bayer Leverkusen
2-1
L
FC Augsburg
27,811
Bundesliga
Nov 3
FT
FC Augsburg
2-1
W
Mainz
28,007
Bundesliga
Nov 9
FT
Bayern Munich
3-0
L
FC Augsburg
71,000
Bundesliga
Profile-HMMs: a brief introduction
Example 2: Football
FC Augsburg 2013/14 Bundesliga Fixtures
Date
Status Home
Score
Aug 10
FT
FC Augsburg
H
L
0-4
Aug 17
FT
Werder Bremen
A
Aug 25
FT
FC Augsburg
Aug 31
FT
Sep 14
Away Attendance
Competition
Borussia Dort.
30,660
Bundesliga
L
1-0
FC Augsburg
40,112
Bundesliga
H
2-1
W
VfB Stuttgart
30,030
Bundesliga
Nurnberg
A
W
0-1
FC Augsburg
37,239
Bundesliga
FT
FC Augsburg
H
W
2-1
SC Freiburg
28,453
Bundesliga
Sep 21
FT
Hannover 96
A
2-1
L
FC Augsburg
39,200
Bundesliga
Sep 27
FT
FC Augsburg
H
2-2
D
Borussia Mon.
30,352
Bundesliga
Oct 5
FT
Schalke 04
A
4-1
L
FC Augsburg
60,731
Bundesliga
Oct 20
FT
FC Augsburg
H
1-2
L
VfL Wolfsburg
27,554
Bundesliga
Oct 26
FT
Bayer LeverkusenA
2-1
L
FC Augsburg
27,811
Bundesliga
Nov 3
FT
FC Augsburg
H
2-1
W
Mainz
28,007
Bundesliga
Nov 9
FT
Bayern Munich
A
3-0
L
FC Augsburg
71,000
Bundesliga
Profile-HMMs: a brief introduction
Example 2: Football
W
D
H
L
A
Profile-HMMs: a brief introduction
Symbols
States
Example 2: The weather
H/L pressure
Profile-HMMs: a brief introduction
Symbols
States
Example 2: The weather
H/L pressure
Profile-HMMs: a brief introduction
Symbols
States
Example 2: The weather
HHHHLLLLHH
Profile-HMMs: a brief introduction
HMMs are probabilistic models defined by:
§ 
§ 
§ 
§ 
!
A finite set I of states
A discrete alphabet X of symbols (observed objects)
A probability transition matrix T=(tij) , i,j states
A probability emission matrix E=(eix), i states, x symbols
Profile-HMMs: a brief introduction
HMMs are probabilistic models defined by:
§ 
§ 
§ 
§ 
A finite set S of states
A discrete alphabet A of symbols (observed objects)
A probability transition matrix T=(tij) , i,j states
A probability emission matrix E=(eix), i state, x symbol
!
tij
tij
tij
tij
…
…
eix
states
symbols
eix
eix
eix
Profile-HMMs: a brief introduction
States = 1,2 ;
Symbols=A,B,C,D
Profile-HMMs: a brief introduction
States = M,S ;
1
Symbols=A,B,C,D
Profile-HMMs: a brief introduction
States = 1,2 ;
Symbols=A,B,C,D
Profile-HMMs: a brief introduction
States = 1,2 ;
Symbols=A,B,C,D
Profile-HMMs: a brief introduction
States = 1,2 ;
Symbols=A,B,C,D
Profile-HMMs: a brief introduction
States = 1,2 ;
Symbols=A,B,C,D
Profile-HMMs: a brief introduction
HMMs are probabilistic models, i.e. models that produce different outcomes
with different probabilities
Example: probability model for a protein sequence
M V R G K T Q M K R I E N A T S R Q V T F !
!
!
Profile-HMMs: a brief introduction
HMMs are probabilistic models, i.e. models that produce different outcomes
with different probabilities
Example: probability model for a protein sequence
M V R G K T Q M K R I E N A T S R Q V T F ;
!
pi, !i=1,20; !pi>0
pa , a=A,M,S,W,…!
Profile-HMMs: a brief introduction
HMMs are probabilistic models, i.e. models that produce different outcomes
with different probabilities
Example: probability model for a protein sequence
M V R G K T Q M K R I E N A T S R Q V T F ;
pM!pV!pR!pG!pK!pT!pQ!pM!pK!pR!pI!pE!pN!pA!pT!pS!pR!pQ!pV!pT!pF!
i
pa , a=A,M,S,W,…!
Profile-HMMs: a brief introduction
Example 2: The weather
Probabilistic, 1st order Markov model
pij≥0;
p
=1
Σ ij
j
Aligning to a family
M
I
D
Aligning to a family
M
S
I
D
E
Aligning to a family
M
S
I
D
E
Profile-HMMs: a brief introduction
Profile-HMMs are probabilistic models…
Model
-> simulates a system
Probabilistic -> produces outcomes based on probabilities
Profile-HMMs: a brief introduction
Example 1: The traffic light
Profile-HMMs: a brief introduction
Example 1: The traffic light
R
Y
G
Profile-HMMs: a brief introduction
Example 1: The traffic light
Deterministic
R
Y
G
Profile-HMMs: a brief introduction
Example 2: The weather
Profile-HMMs: a brief introduction
Example 2: The weather
S
C
R
Profile-HMMs: a brief introduction
Example 2: The weather
*
Probabilistic
S
C
R
*In fact, chaotic, deterministic
Profile-HMMs: a brief introduction
Example 2: The weather
*
Probabilistic
pij
Transition probability
from symbol i to symbol j
*In fact, chaotic, deterministic
Profile-HMMs: a brief introduction
Example 2: The weather
Probabilistic, 1st order Markov model
P(Xn+1=x|X1=x1,…,Xn=xn)=P(Xn+1=x|Xn=xn)
Related documents