Download or O - Medical Bioinformatics

Document related concepts

Metalloprotein wikipedia , lookup

Interactome wikipedia , lookup

Point mutation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Protein wikipedia , lookup

Western blot wikipedia , lookup

Biochemistry wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Proteolysis wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
VL Algorithmische BioInformatik (19710)
WS2015/2016
Woche 8 - Montag
Tim Conrad
AG Medical Bioinformatics
Institut für Mathematik & Informatik, Freie Universität Berlin
Vorlesungsthemen
Part 1: Background Basics (4)
1. The Nucleic Acid World
2. Protein Structure
3. Dealing with Databases
Part 2: Sequence Alignments (3)
4. Producing and Analyzing Sequence
Alignments
5. Pairwise Sequence Alignment and
Database Searching
6. Patterns, Profiles, and Multiple
Alignments
Part 3: Evolutionary Processes (3)
7. Recovering Evolutionary History
8. Building Phylogenetic Trees
Part 5: Secondary Structures (4)
11. Obtaining Secondary Structure from
Sequence
12. Predicting Secondary Structures
Part 6: Tertiary Structures (4)
13. Modeling Protein Structure
14. Analyzing Structure-Function
Relationships
Part 7: Cells and Organisms (8)
15. Proteome and Gene Expression
Analysis
16. Clustering Methods and Statistics
17. Systems Biology
Part 4: Genome Characteristics (4)
9. Revealing Genome Features
10. Gene Detection and Genome
Annotation
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
2
MUMmer: Algorithm
Read two
genomes
Using SNPs,
mutation regions,
repeats, tandem
repeats
Perform Maximum Unique
Match (MUM) of genomes
using suffix tree
Close the gaps
in the
Alignment
Output
alignment
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
Sort and order the
MUMs using LIS
• MUMs
• regions that
not
match exactly
do
3
Suffix tree
• To find the longest subsequence of a
string quickly
• Definition: a compact representation
of all possible suffixes of an input S
• Can be built in O(m) time and space
where m=| S |
• Search of sub-string X takes O(n)
time, n=| X |
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
4
4
Suffix Trees
Example: TORONTO$
‘$’ is terminating character
2
0
5
6
3
1
4
Suffix Trees
Example: TORONTO$
Searching for ‘ONT’
2
0
5
6
3
1
4
Suffix Trees
Example: TORONTO$
Searching for ‘ONT’
2
0
5
6
3
1
4
Suffix Trees
Example: TORONTO$
Searching for ‘ONT’
2
0
5
6
3
1
4
Suffix Trees
Example: TORONTO$
Searching for ‘ONT’
2
0
5
6
3
1
‘ONT’ at position 3 in S
4
Maximal Unique Match
Sequences in genomes A and B that:
occur exactly once in A and in B
are not contained in any larger such sequence
A:
tcgatcGACGATCGC…AGCATAAcgact
Genome
B:
gcattaGACGATCGC…AGCATAAtcca
Genome
A
B
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
10
Finding, sorting MUMs
MUM: Internal node
with a leaf from
each genome in its
subtree
With single scan of
the suffix tree, find
all MUMs
Sort MUMs based on
their position in
genome A.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
11
11
Finding MUMs from a suffix tree
Matching MUMs
1
2
3
4
5
6
7
A
B
1
3
2
6
4
5
7
Select longest consistent set of MUMs
occurring in the same order in A and B
1
2
4
5
7
A
B
1
2
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
4
5
7
13
Choosing MUMs
Configuration can be uniquely represented:
 P = {1, 2, 3, 4, 6, 7, 5};
 LIS(P) = {1, 2, 3, 4, 6, 7}


Determining optimal sequence of MUMs reduces to
finding LIS of P
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
14
IS Definition
Increasing Subsequence: values (strictly)
increase from left to right
Sequence P = {4, 2, 1, 5, 8, 6, 9, 10}
Examples of two increasing subsequences:
{4, 5, 9} or {2, 5, 6, 9, 10}
Can be solved by greedy algorithms
(find minimum cover)
Cover of P: set of increasing subsequences of P that contains all numbers of P
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
15
Matching MUMs
• Sort, LIS=> O(KlogK) => O(N)
– K: the numbers of MUMs
– K<<N/logN
– Actually two steps: finding greedily minimum cover in O(k log k)
and finding LIS from cover O(k)
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
16
Closing the Gaps
After global-MUM alignment found, need to close local
gaps
Gap: interruption in MUM-alignment
Types of gaps:
SNP Single Nucleotide Polymorphisms
Insertion
Highly polymorphic region
Repeat
How?
Long gaps: repeat procedure using a shorter minimum length for MUMs
Short gaps: Smith-Waterman alignments
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
17
Closing the Gaps
SNP (Single Nucleotide Polymorphism):
Genom A: cgtcatgggcgttcgtcgttg
Genom B: cgtcatgggcattcgtcgttg
Insertion:
Genom A: cggggtaaccgc..................cctggtcggg
Genom B: cggggtaaccgcgttgctcggggtaaccgccctggtcggg
Highly polymorphic regions:
Genom A: ccgcctcgcctgg.gctggcgcccgctc
Genom B: ccgcctcgccagttgaccgcgcccgctc
Repeat sequence:
Genom A: cTGGGTGGGACAACGTaaaaaaaaaTGGGTGGGACAACGTc
Genom B: aTGGGTGGGACgACGTgggggggggTGGGTGGGACAACGTa
Imperfect repeat
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
18
18
Some results from the original MUMmer paper
“Alignment of whole genomes“, Delcher et at
FASTA
1000bp segments.
Pairs of sequences
that were at least
50% identical over
80% of the match
appear as points in
the plot.
25mers
MUMmer
Figure 7. Alignment of M.genitalium and M.pneumoniae using FASTA (top), 25mers (middle) and MUMs (bottom). In all three plots, a point indicates a ‘match’ between the genomes. In the
FASTA plot a point corresponds to similar genes. In the 25mer plot, each point indicates a 25-base sequence that occurs exactly once in each genome. In the MUM plot, points correspond to
MUMs as defined in the main text.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
19
Some results
Align two cousin bacteria, M.genitalium
(580 kbp) and M.pneumoniae (816 kbp)
Time: 6.5s suffix tree; finding LIS 0.02s;
116s alignments.
Longest MUM 281 bp, 16 MUMs > 100 bp,
<50% identical
Align two highly homologous strains of
M.tuberculosis, 4.4 million bps.
Time: 5s suffix tree construction, 45s sorting
MUMs, 5s Smith-Waterman alignments.
Longest MUM 24.563 bp;
249 MUMs > 5000 bp; >90% identical
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
20
20
Some results
Alignment of two syntenic
sequences from human
chromosome 12 and
mouse chromosome 6
(225 kbp).
Time: 29s in total,
1.6s for suffix tree.
Longest MUM, 117 bp,
10 MUMs > 50bp
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
21
21
MUMmer 2
Problem with MUMmer 1
Align only DNA sequences
Needs lots of memory
Can not align incomplete genomes
Solution: MUMmer2
3x faster than MUMmer 1
Requires 1/3 space
Can align protein strands and incomplete genomes
Parallel alignment
Delcheret al., Nucleic Acids Research (2002)
http://www.tigr.org/software/mummer/MUMmer2.pdf
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
22
22
MUMmer 2
Alternative to find initial exact matches
Identify where the query sequence would branch off
from the tree, to find all matches
Unique match
Wherever a branch occurs at a tree position with just a
single leaf beneath it
Maximal match
Using suffix links to find next match (extended match)
By checking the character immediately preceding the start
of this match, we can determine whether it is a maximal
match
Find all maximal matches: time proportional to the length
of the query
Suffix Trees
MUMmer wants to find all maximal
unique matches for all suffixes:
E.g., for query ACCGTGCGTC, we want:
ACCGTGCGTC
CCGTGCGTC
CGTGCGTC
GTGCGTC
…
Up to some reasonable limit…
Idea: don’t go back to root of tree each time…
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
24
Suffix Trees
Suffix Links
All internal, non-root nodes have a suffix
link to another node
If x is a single character and a is a (possibly
empty) string (subsequence), then the path
from the root to a node v spelling ax (pathlabel is ax) has a suffix link to node v’,
whose path-label is a.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
25
Suffix Links
The dotted lines indicate the suffix links. If you start at the blue
node and follow the suffix links from there (from blue, to green, to
first gray, to second gray), and look at the strings leading from
the root to each node, you will see this:
http://stackoverflow.com/questions/10168097/how-and-when-to-create-a-suffix-link-in-suffix-tree
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
26
Streaming algorithm - unique match
The match is unique, because there is a single leaf below this position in the tree.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
27
Streaming algorithm - maximal match
Suffix links are used to find extended match
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
28
MUMmer 2
Improvements
Use only 20 bytes per bp (MUMmer, 38 bytes)
Kurtz (1999)
Build suffix tree for the shorter sequence
Find MUMs by streaming the second sequences
against suffix tree, Chang-Lawler (1994)
cluster the matches
Time
MUM1
74s (1GHz)
MUM2
27s (1GHz)
Mem
293MB
100MB
To align 4.7 Mb genome of E. coli and 3.0Mb large chromosome of V.cholerae
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
29
29
New in MUMmer 2: Clustering step
To align unfinished assembly which
needs rearrangement
Cluster MUMs
After matches are identified, the interval
length between matches are checked
If the interval length between matches is less
than a user-defined gap length, the matches
are joined into a cluster
Find Longest Increasing Subsequence
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
30
30
NUCmer (NUCleotide MUMmer)
For finishing phase of assembly
Multiple-contigs alignment program
Uses MUMmer 2
Can
Compare assemblies at different stages of
project
Compare unfinished genomes to a closely
related genome (speed up finishing step)
Compare outputs of two different assembly
program
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
31
31
NUCmer
Inputs: two multi-fasta files
Output: alignment of every contig in the first file
to every sequence in the second file
Algorithm
Create a map of all contig positions within each file
Concatenate contigs in each file
Run MUMmer to find MUMs
Map back the matches to the separate contigs
Cluster MUMs
(Modified) Smith-Waterman DP alignment to align the
sequence between MUMs
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
32
32
PROmer
Protein-based alignment program
Input: two multi-fasta files
Technique:
Translate DNA into AA in all 6 reading frames
Map each protein to DNA sequence
Concatenate all potential proteins
Run MUMmer, cluster MUMs based on DNA coordinates
Examine a series of consecutive, consistent matches
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
33
33
Campylobacter PROmer analysis
Fouts et al. (PLoS Biol. 2005)
Major structural differences and novel potential
virulence mechanisms from the genomes of multiple
campylobacter species.
• One genome is used as the x-axis
for all four pair-wise comparisons
• X-shape characteristic of
collinearity interrupted by
inversions around the origin or
terminus of replication
• Loss of collinearity in more distant
comparisons
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
34
Some results
Align P.yeolii (5 * coverage) and
P.falciparum (8 * coverage), size 25 Mb
PROmer : time < 1 h
Blast : time ~ weeks
>70% of human chromosome 14 is
duplication of part of chromosome 2
Align E.coli (4.7 Mb) and V.cholerae (3 Mb)
on 1 GHz desktop computer
MUMmer 1: 74 s, 293 MB
MUMmer 2: 27 s, 100 MB
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
35
35
Improvements
MUMmer 3
Optimized suffix-tree library
Faster and requires 25% less memory (see Kurtz et al.)
Non-unique maximal matches
GUI
Now open source
Align Human vs human genome
Computer : Sun-Sparc, Solaris OS,64 GB, 950
MHz
Size: 2,839 Mbps
Time: suffix tree, 4.7 h; 4 GB Memory; query,
101.5 h; Total 4.5 days
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
36
36
Benchmarks MUMmer 2.1 vs. 3.0
MUMmer 3.0, page 4
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
37
Human Gut metagenome
Percent Identity Plot (PIP) of random shotgun
reads to a complete Bifidobacterium genome
and a good quality draft Methanobrevibacter
genome
Gill et al. (Science, 2006)
Metagenomic analysis of the human distal gut microbiome.
Anaerobic bacteria. They are ubiquitous, endosymbiotic
inhabitants of the gastrointestinal tract, vagina and mouth (B.
dentium) of mammals, including humans. Some bifidobacteria
are used as probiotics.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
38
Mauve Multiple Genome Aligner
• Able to identify and align collinear
regions of multiple genomes even in
the presence of rearrangements
• Find and extend seed matches
• Group into locally collinear blocks
• Align intervening regions
Darling et al. Genome Res. 2004
Jul;14(7):1394-403.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
39
Progressive Mauve alignment of 12 E. coli genome
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
40
Vorlesungsthemen
Part 1: Background Basics (4)
1. The Nucleic Acid World
2. Protein Structure
3. Dealing with Databases
Part 2: Sequence Alignments (3)
4. Producing and Analyzing Sequence
Alignments
5. Pairwise Sequence Alignment and
Database Searching
6. Patterns, Profiles, and Multiple
Alignments
Part 3: Evolutionary Processes (3)
7. Recovering Evolutionary History
8. Building Phylogenetic Trees
Part 5: Secondary Structures (4)
11. Obtaining Secondary Structure from
Sequence
12. Predicting Secondary Structures
Part 6: Tertiary Structures (4)
13. Modeling Protein Structure
14. Analyzing Structure-Function
Relationships
Part 7: Cells and Organisms (8)
15. Proteome and Gene Expression
Analysis
16. Clustering Methods and Statistics
17. Systems Biology
Part 4: Genome Characteristics (4)
9. Revealing Genome Features
10. Gene Detection and Genome
Annotation
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
41
The next sessions
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
42
Today
Buch 11.1-11.3
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
43
Proteins 101
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
Protein Functions
• How do proteins do so much?
• Proteins FOLD spontaneously
• Assume a characteristic 3D SHAPE
• Shape depends on particular Amino Acid
Sequence
• Shape gives SPECIFIC function
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
45
What is protein structure?
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
46
Proteins are linear polymers that fold up by
themselves…mostly.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
47
Secondary Structure
http://www.abcte.org/files/previews/biology/
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
http://bioweb.wku.edu/courses/biol22000/3AAprotein/images/
48
What are proteins made of?
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
49
The parts of a protein
H
OH
“Backbone”: N, C, C, N, C, C…
R: “side chain”
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
50
Two or more Amino Acids:
Polypeptide
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
51
Peptide Bond
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
52
The amino acids
They can be grouped by
properties in many ways
according to the chemical and
physical properties (e.g. size) of
the side chain.
Here is one grouping based on
chemical properties:
•Basic: proton acceptors
•Acidic: proton donors
•Uncharged polar: have polar
groups like CONH2 or CH2OH
•Nonpolar: tend to be
hydrophobic
•Weird: proline links to the N in
the main chain
•Strong: Cysteine can make
“disulphide bridges”
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
53
What forces determine protein structure?
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
54
Minimum free energy
• Proteins tend to fold naturally to the state of
minimum free energy (Christian Anfinsen).
• This state is determined by forces due to
interactions among the residues.
• Proteins usually fold in an aqueous
environment, so interactions with water
molecules are key.
• Some proteins fold in membranes, so
interactions with lipids are important.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
55
Atomic Bonds
• Covalent bonds – strong!
•
Single bonds can usually rotate freely
•
Double bonds are rigid
• Hydrogen bonds – weak
•
Oxygen and Nitrogen share a proton (Hydrogen)
• Van der Waals forces – weaker still
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
56
Planar Peptide bond
Flexible C-alpha bonds
Single bonds
rotate
Resonance makes
Peptide bonds planar
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
The C-alpha bonds have
two free rotation angles:
phi and psi
57
Peptide Bonds
• Backbone can swivel:
• DIHEDRAL ANGLES
• 2 per Amino Acid
• Proteins can be 100’s
of Amino Acids in
length!
• Lots of freedom of
movement
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
58
If you plot phi vs. psi, you see that some
combinations are preferred
Ideal
Real (a kinase)
Ramachandran Plots
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
59
What is secondary structure?
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
60
Certain repetitive structures are energetically favorable
• These make lots of hydrogen bonds among
residues.
• They don’t encounter lots of steric
hindrances.
• They occur over and over again in natural
proteins.
• Some combinations of secondary structures
are so common they are called “folds” (e.g.,
the SCOP database of protein folds).
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
61
What are the primary secondary structures?
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
62
Alpha Helix
•3.6 amino acid (residues)
per turn
•O(i) hydrogen bonds to
N(i+4)
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
63
Beta Sheet
A. Three strands shown
B. Anti-parallel sheet
C. Parallel sheet
•Sheets are usually curved
and can even form barrels.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
64
Beta Turns:
getting around tight corners
•Steric hindrance
determines whether a
tight turn is possible
•R3’s side chain is
usually Hydrogen (R3
is glycine)
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
65
Supersecondary Structure
A: beta-alpha-beta
B: beta-meander
C: Greek-key
D: Greek-key
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
66
Tertiary Structure
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
67
Folds
Folds are way to classify proteins by tertiary structure
SCOP: Structural Classification of Proteins
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
68
How is protein structure determined experimentally?
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
69
X-ray crystallography
•Needs crystallized
proteins
•Hard to get crystals
•Very tough for
hydrophobic (e.g.
transmembrane) proteins
•Better accuracy than
NMR
•Expensive:
$100,000/protein
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
70
NMR spectroscopy
• Protons resonate at a frequency that
depends on their chemical environment.
• This can be used to predict structure.
• Does not require crystallization; protein
may be in solution.
• Lower resolution than X-ray crystallography
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
71
Protein DataBank (PDB)
X-ray: 84,739
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
NMR: 10,223
72
How can protein structure be predicted in silico?
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
73
Tertiary structure prediction is still too hard
• Ab initio modeling
•
Uses primary sequence only
•
E.g., Rosetta
• Comparative modeling
•
Uses sequence alignment to protein of
known structure
•
E.g., Modeller
Rosetta prediction
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
74
Protein Structure
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
75
The Prediction Problem
Can we predict the final 3D protein structure
knowing only its amino acid sequence?
Studied for 4 Decades
“Holy Grail” in Biological Sciences
Primary Motivation for Bioinformatics
Based on this 1-to-1 Mapping of Sequence
to Structure
• Still very much an OPEN PROBLEM
•
•
•
•
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
76
PSP: Goals
• Accurate 3D structures. But not there yet.
• Good “guesses”
• Working models for researchers
• Understand the FOLDING PROCESS
• Get into the Black Box
• Only hope for some proteins
• 25% won’t crystallize, too big for NMR
• Best hope for novel protein engineering
• Drug design, etc.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
77
PSP: Major Hurdles
• Energetics
• We don’t know all the forces involved in detail
• Too computationally expensive BY FAR!
• Conformational search impossibly large
• 100 a.a. protein, 2 moving dihedrals, 2 possible positions
for each diheral: 2200 conformations!
• Levinthal’s Paradox
•
Longer than time of universe to search
•
Proteins fold in a couple of seconds??
• Multiple-minima problem
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
78
Protein Folding
• What we DO know...
• Protein folding is FAST!!
• Typically a couple of seconds
• Folding is CONSISTENT!!
• Involves weak forces – Non-Covalent
• Hydrogen Bonding, van der Waals, Salt Bridges
• Mostly, 2-STATE systems
• VERY FEW INTERMEDIATES
• Makes it hard to study – BLACK BOX
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
79
Protein Folding
• What we DON’T know...
• Mechanism...?
• Forces...?
• Relative contributions?
• Hydrophobic Force thought to be critical
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
80
Secondary Structure Prediction
• Much simpler to
predict a small set
of classes than to
predict 3-D
coordinates of
atoms.
• Amino acids have
different
propensities for
•
(a) alpha helices,
•
(b) beta sheets and
•
(c) turns.
• Homology can also
be used since fold
is more conserved
than sequence.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
Buch 11.2
81
Problem Statement
• Predicting Secondary
Protein Structure from
amino acid Sequences
• Secondary Protein Structure: The
local shape of polypeptide chain
dependent on the hydrogen bonds
between different amino acids
• In the big picture, the goal is to
solve tertiary/quaternary structure
in 3D. By solving the more easily
tackled secondary structure
problem in 2D, we’re solving an
easier problem.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
82
Protein Structure
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
83
Goals, Challenges, Techniques
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
Secondary Structure Prediction
• Given a protein sequence a1a2…aN, secondary
structure prediction aims at defining the state of
each amino acid ai as being either H (helix), E
(extended=strand), or O (other) (Some methods
have 4 states: H, E, T for turns, and O for other).
• The quality of secondary structure prediction is
measured with a “3-state accuracy” score, or Q3.
Q3 is the percent of residues that match “reality”
(X-ray structure).
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
85
Creating a Primary-to-Secondary Structure Predictor
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
86
The Task
Given the sequence (primary
structure) of a protein, predict its
secondary structure.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
87
Predict what?
• There are many types of secondary
structure.
• Which do we want to predict?
•
•
•
•
•
•
•
•
Alpha helix
Beta strand
Beta turn
Random coil
Pi-helices
310-helices
Type I turns
…
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
88
Why do it?
• Is secondary structure prediction
useful?
• Short answer: yes
• Long answer:
• The original hope was to “bootstrap” from
secondary to tertiary prediction; this goal
remains elusive…
• Secondary structure can give clues to function
since many enzymes, DNA binding proteins,
membrane proteins have characteristic
secondary structures.
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
89
How can we do it?
• How would you predict the secondary
structure state of each residue (amino
acid) in a protein?
• Besides the sequence itself, what else
would you want to use?
• What kind of computer algorithms
would help?
• ???
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
90
Types of Prediction Methods
Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016
91
Mehr Informationen im Internet unter
medicalbioinformatics.de/teaching
Vielen Dank!
Tim Conrad
AG Medical Bioinformatics
www.medicalbioinformatics.de
Weitere
Fragen