Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Metalloprotein wikipedia , lookup
Interactome wikipedia , lookup
Point mutation wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Western blot wikipedia , lookup
Biochemistry wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
VL Algorithmische BioInformatik (19710) WS2015/2016 Woche 8 - Montag Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin Vorlesungsthemen Part 1: Background Basics (4) 1. The Nucleic Acid World 2. Protein Structure 3. Dealing with Databases Part 2: Sequence Alignments (3) 4. Producing and Analyzing Sequence Alignments 5. Pairwise Sequence Alignment and Database Searching 6. Patterns, Profiles, and Multiple Alignments Part 3: Evolutionary Processes (3) 7. Recovering Evolutionary History 8. Building Phylogenetic Trees Part 5: Secondary Structures (4) 11. Obtaining Secondary Structure from Sequence 12. Predicting Secondary Structures Part 6: Tertiary Structures (4) 13. Modeling Protein Structure 14. Analyzing Structure-Function Relationships Part 7: Cells and Organisms (8) 15. Proteome and Gene Expression Analysis 16. Clustering Methods and Statistics 17. Systems Biology Part 4: Genome Characteristics (4) 9. Revealing Genome Features 10. Gene Detection and Genome Annotation Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 2 MUMmer: Algorithm Read two genomes Using SNPs, mutation regions, repeats, tandem repeats Perform Maximum Unique Match (MUM) of genomes using suffix tree Close the gaps in the Alignment Output alignment Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 Sort and order the MUMs using LIS • MUMs • regions that not match exactly do 3 Suffix tree • To find the longest subsequence of a string quickly • Definition: a compact representation of all possible suffixes of an input S • Can be built in O(m) time and space where m=| S | • Search of sub-string X takes O(n) time, n=| X | Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 4 4 Suffix Trees Example: TORONTO$ ‘$’ is terminating character 2 0 5 6 3 1 4 Suffix Trees Example: TORONTO$ Searching for ‘ONT’ 2 0 5 6 3 1 4 Suffix Trees Example: TORONTO$ Searching for ‘ONT’ 2 0 5 6 3 1 4 Suffix Trees Example: TORONTO$ Searching for ‘ONT’ 2 0 5 6 3 1 4 Suffix Trees Example: TORONTO$ Searching for ‘ONT’ 2 0 5 6 3 1 ‘ONT’ at position 3 in S 4 Maximal Unique Match Sequences in genomes A and B that: occur exactly once in A and in B are not contained in any larger such sequence A: tcgatcGACGATCGC…AGCATAAcgact Genome B: gcattaGACGATCGC…AGCATAAtcca Genome A B Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 10 Finding, sorting MUMs MUM: Internal node with a leaf from each genome in its subtree With single scan of the suffix tree, find all MUMs Sort MUMs based on their position in genome A. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 11 11 Finding MUMs from a suffix tree Matching MUMs 1 2 3 4 5 6 7 A B 1 3 2 6 4 5 7 Select longest consistent set of MUMs occurring in the same order in A and B 1 2 4 5 7 A B 1 2 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 4 5 7 13 Choosing MUMs Configuration can be uniquely represented: P = {1, 2, 3, 4, 6, 7, 5}; LIS(P) = {1, 2, 3, 4, 6, 7} Determining optimal sequence of MUMs reduces to finding LIS of P Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 14 IS Definition Increasing Subsequence: values (strictly) increase from left to right Sequence P = {4, 2, 1, 5, 8, 6, 9, 10} Examples of two increasing subsequences: {4, 5, 9} or {2, 5, 6, 9, 10} Can be solved by greedy algorithms (find minimum cover) Cover of P: set of increasing subsequences of P that contains all numbers of P Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 15 Matching MUMs • Sort, LIS=> O(KlogK) => O(N) – K: the numbers of MUMs – K<<N/logN – Actually two steps: finding greedily minimum cover in O(k log k) and finding LIS from cover O(k) Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 16 Closing the Gaps After global-MUM alignment found, need to close local gaps Gap: interruption in MUM-alignment Types of gaps: SNP Single Nucleotide Polymorphisms Insertion Highly polymorphic region Repeat How? Long gaps: repeat procedure using a shorter minimum length for MUMs Short gaps: Smith-Waterman alignments Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 17 Closing the Gaps SNP (Single Nucleotide Polymorphism): Genom A: cgtcatgggcgttcgtcgttg Genom B: cgtcatgggcattcgtcgttg Insertion: Genom A: cggggtaaccgc..................cctggtcggg Genom B: cggggtaaccgcgttgctcggggtaaccgccctggtcggg Highly polymorphic regions: Genom A: ccgcctcgcctgg.gctggcgcccgctc Genom B: ccgcctcgccagttgaccgcgcccgctc Repeat sequence: Genom A: cTGGGTGGGACAACGTaaaaaaaaaTGGGTGGGACAACGTc Genom B: aTGGGTGGGACgACGTgggggggggTGGGTGGGACAACGTa Imperfect repeat Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 18 18 Some results from the original MUMmer paper “Alignment of whole genomes“, Delcher et at FASTA 1000bp segments. Pairs of sequences that were at least 50% identical over 80% of the match appear as points in the plot. 25mers MUMmer Figure 7. Alignment of M.genitalium and M.pneumoniae using FASTA (top), 25mers (middle) and MUMs (bottom). In all three plots, a point indicates a ‘match’ between the genomes. In the FASTA plot a point corresponds to similar genes. In the 25mer plot, each point indicates a 25-base sequence that occurs exactly once in each genome. In the MUM plot, points correspond to MUMs as defined in the main text. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 19 Some results Align two cousin bacteria, M.genitalium (580 kbp) and M.pneumoniae (816 kbp) Time: 6.5s suffix tree; finding LIS 0.02s; 116s alignments. Longest MUM 281 bp, 16 MUMs > 100 bp, <50% identical Align two highly homologous strains of M.tuberculosis, 4.4 million bps. Time: 5s suffix tree construction, 45s sorting MUMs, 5s Smith-Waterman alignments. Longest MUM 24.563 bp; 249 MUMs > 5000 bp; >90% identical Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 20 20 Some results Alignment of two syntenic sequences from human chromosome 12 and mouse chromosome 6 (225 kbp). Time: 29s in total, 1.6s for suffix tree. Longest MUM, 117 bp, 10 MUMs > 50bp Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 21 21 MUMmer 2 Problem with MUMmer 1 Align only DNA sequences Needs lots of memory Can not align incomplete genomes Solution: MUMmer2 3x faster than MUMmer 1 Requires 1/3 space Can align protein strands and incomplete genomes Parallel alignment Delcheret al., Nucleic Acids Research (2002) http://www.tigr.org/software/mummer/MUMmer2.pdf Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 22 22 MUMmer 2 Alternative to find initial exact matches Identify where the query sequence would branch off from the tree, to find all matches Unique match Wherever a branch occurs at a tree position with just a single leaf beneath it Maximal match Using suffix links to find next match (extended match) By checking the character immediately preceding the start of this match, we can determine whether it is a maximal match Find all maximal matches: time proportional to the length of the query Suffix Trees MUMmer wants to find all maximal unique matches for all suffixes: E.g., for query ACCGTGCGTC, we want: ACCGTGCGTC CCGTGCGTC CGTGCGTC GTGCGTC … Up to some reasonable limit… Idea: don’t go back to root of tree each time… Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 24 Suffix Trees Suffix Links All internal, non-root nodes have a suffix link to another node If x is a single character and a is a (possibly empty) string (subsequence), then the path from the root to a node v spelling ax (pathlabel is ax) has a suffix link to node v’, whose path-label is a. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 25 Suffix Links The dotted lines indicate the suffix links. If you start at the blue node and follow the suffix links from there (from blue, to green, to first gray, to second gray), and look at the strings leading from the root to each node, you will see this: http://stackoverflow.com/questions/10168097/how-and-when-to-create-a-suffix-link-in-suffix-tree Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 26 Streaming algorithm - unique match The match is unique, because there is a single leaf below this position in the tree. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 27 Streaming algorithm - maximal match Suffix links are used to find extended match Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 28 MUMmer 2 Improvements Use only 20 bytes per bp (MUMmer, 38 bytes) Kurtz (1999) Build suffix tree for the shorter sequence Find MUMs by streaming the second sequences against suffix tree, Chang-Lawler (1994) cluster the matches Time MUM1 74s (1GHz) MUM2 27s (1GHz) Mem 293MB 100MB To align 4.7 Mb genome of E. coli and 3.0Mb large chromosome of V.cholerae Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 29 29 New in MUMmer 2: Clustering step To align unfinished assembly which needs rearrangement Cluster MUMs After matches are identified, the interval length between matches are checked If the interval length between matches is less than a user-defined gap length, the matches are joined into a cluster Find Longest Increasing Subsequence Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 30 30 NUCmer (NUCleotide MUMmer) For finishing phase of assembly Multiple-contigs alignment program Uses MUMmer 2 Can Compare assemblies at different stages of project Compare unfinished genomes to a closely related genome (speed up finishing step) Compare outputs of two different assembly program Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 31 31 NUCmer Inputs: two multi-fasta files Output: alignment of every contig in the first file to every sequence in the second file Algorithm Create a map of all contig positions within each file Concatenate contigs in each file Run MUMmer to find MUMs Map back the matches to the separate contigs Cluster MUMs (Modified) Smith-Waterman DP alignment to align the sequence between MUMs Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 32 32 PROmer Protein-based alignment program Input: two multi-fasta files Technique: Translate DNA into AA in all 6 reading frames Map each protein to DNA sequence Concatenate all potential proteins Run MUMmer, cluster MUMs based on DNA coordinates Examine a series of consecutive, consistent matches Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 33 33 Campylobacter PROmer analysis Fouts et al. (PLoS Biol. 2005) Major structural differences and novel potential virulence mechanisms from the genomes of multiple campylobacter species. • One genome is used as the x-axis for all four pair-wise comparisons • X-shape characteristic of collinearity interrupted by inversions around the origin or terminus of replication • Loss of collinearity in more distant comparisons Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 34 Some results Align P.yeolii (5 * coverage) and P.falciparum (8 * coverage), size 25 Mb PROmer : time < 1 h Blast : time ~ weeks >70% of human chromosome 14 is duplication of part of chromosome 2 Align E.coli (4.7 Mb) and V.cholerae (3 Mb) on 1 GHz desktop computer MUMmer 1: 74 s, 293 MB MUMmer 2: 27 s, 100 MB Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 35 35 Improvements MUMmer 3 Optimized suffix-tree library Faster and requires 25% less memory (see Kurtz et al.) Non-unique maximal matches GUI Now open source Align Human vs human genome Computer : Sun-Sparc, Solaris OS,64 GB, 950 MHz Size: 2,839 Mbps Time: suffix tree, 4.7 h; 4 GB Memory; query, 101.5 h; Total 4.5 days Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 36 36 Benchmarks MUMmer 2.1 vs. 3.0 MUMmer 3.0, page 4 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 37 Human Gut metagenome Percent Identity Plot (PIP) of random shotgun reads to a complete Bifidobacterium genome and a good quality draft Methanobrevibacter genome Gill et al. (Science, 2006) Metagenomic analysis of the human distal gut microbiome. Anaerobic bacteria. They are ubiquitous, endosymbiotic inhabitants of the gastrointestinal tract, vagina and mouth (B. dentium) of mammals, including humans. Some bifidobacteria are used as probiotics. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 38 Mauve Multiple Genome Aligner • Able to identify and align collinear regions of multiple genomes even in the presence of rearrangements • Find and extend seed matches • Group into locally collinear blocks • Align intervening regions Darling et al. Genome Res. 2004 Jul;14(7):1394-403. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 39 Progressive Mauve alignment of 12 E. coli genome Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 40 Vorlesungsthemen Part 1: Background Basics (4) 1. The Nucleic Acid World 2. Protein Structure 3. Dealing with Databases Part 2: Sequence Alignments (3) 4. Producing and Analyzing Sequence Alignments 5. Pairwise Sequence Alignment and Database Searching 6. Patterns, Profiles, and Multiple Alignments Part 3: Evolutionary Processes (3) 7. Recovering Evolutionary History 8. Building Phylogenetic Trees Part 5: Secondary Structures (4) 11. Obtaining Secondary Structure from Sequence 12. Predicting Secondary Structures Part 6: Tertiary Structures (4) 13. Modeling Protein Structure 14. Analyzing Structure-Function Relationships Part 7: Cells and Organisms (8) 15. Proteome and Gene Expression Analysis 16. Clustering Methods and Statistics 17. Systems Biology Part 4: Genome Characteristics (4) 9. Revealing Genome Features 10. Gene Detection and Genome Annotation Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 41 The next sessions Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 42 Today Buch 11.1-11.3 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 43 Proteins 101 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 Protein Functions • How do proteins do so much? • Proteins FOLD spontaneously • Assume a characteristic 3D SHAPE • Shape depends on particular Amino Acid Sequence • Shape gives SPECIFIC function Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 45 What is protein structure? Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 46 Proteins are linear polymers that fold up by themselves…mostly. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 47 Secondary Structure http://www.abcte.org/files/previews/biology/ Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 http://bioweb.wku.edu/courses/biol22000/3AAprotein/images/ 48 What are proteins made of? Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 49 The parts of a protein H OH “Backbone”: N, C, C, N, C, C… R: “side chain” Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 50 Two or more Amino Acids: Polypeptide Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 51 Peptide Bond Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 52 The amino acids They can be grouped by properties in many ways according to the chemical and physical properties (e.g. size) of the side chain. Here is one grouping based on chemical properties: •Basic: proton acceptors •Acidic: proton donors •Uncharged polar: have polar groups like CONH2 or CH2OH •Nonpolar: tend to be hydrophobic •Weird: proline links to the N in the main chain •Strong: Cysteine can make “disulphide bridges” Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 53 What forces determine protein structure? Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 54 Minimum free energy • Proteins tend to fold naturally to the state of minimum free energy (Christian Anfinsen). • This state is determined by forces due to interactions among the residues. • Proteins usually fold in an aqueous environment, so interactions with water molecules are key. • Some proteins fold in membranes, so interactions with lipids are important. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 55 Atomic Bonds • Covalent bonds – strong! • Single bonds can usually rotate freely • Double bonds are rigid • Hydrogen bonds – weak • Oxygen and Nitrogen share a proton (Hydrogen) • Van der Waals forces – weaker still Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 56 Planar Peptide bond Flexible C-alpha bonds Single bonds rotate Resonance makes Peptide bonds planar Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 The C-alpha bonds have two free rotation angles: phi and psi 57 Peptide Bonds • Backbone can swivel: • DIHEDRAL ANGLES • 2 per Amino Acid • Proteins can be 100’s of Amino Acids in length! • Lots of freedom of movement Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 58 If you plot phi vs. psi, you see that some combinations are preferred Ideal Real (a kinase) Ramachandran Plots Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 59 What is secondary structure? Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 60 Certain repetitive structures are energetically favorable • These make lots of hydrogen bonds among residues. • They don’t encounter lots of steric hindrances. • They occur over and over again in natural proteins. • Some combinations of secondary structures are so common they are called “folds” (e.g., the SCOP database of protein folds). Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 61 What are the primary secondary structures? Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 62 Alpha Helix •3.6 amino acid (residues) per turn •O(i) hydrogen bonds to N(i+4) Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 63 Beta Sheet A. Three strands shown B. Anti-parallel sheet C. Parallel sheet •Sheets are usually curved and can even form barrels. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 64 Beta Turns: getting around tight corners •Steric hindrance determines whether a tight turn is possible •R3’s side chain is usually Hydrogen (R3 is glycine) Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 65 Supersecondary Structure A: beta-alpha-beta B: beta-meander C: Greek-key D: Greek-key Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 66 Tertiary Structure Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 67 Folds Folds are way to classify proteins by tertiary structure SCOP: Structural Classification of Proteins Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 68 How is protein structure determined experimentally? Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 69 X-ray crystallography •Needs crystallized proteins •Hard to get crystals •Very tough for hydrophobic (e.g. transmembrane) proteins •Better accuracy than NMR •Expensive: $100,000/protein Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 70 NMR spectroscopy • Protons resonate at a frequency that depends on their chemical environment. • This can be used to predict structure. • Does not require crystallization; protein may be in solution. • Lower resolution than X-ray crystallography Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 71 Protein DataBank (PDB) X-ray: 84,739 Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 NMR: 10,223 72 How can protein structure be predicted in silico? Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 73 Tertiary structure prediction is still too hard • Ab initio modeling • Uses primary sequence only • E.g., Rosetta • Comparative modeling • Uses sequence alignment to protein of known structure • E.g., Modeller Rosetta prediction Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 74 Protein Structure Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 75 The Prediction Problem Can we predict the final 3D protein structure knowing only its amino acid sequence? Studied for 4 Decades “Holy Grail” in Biological Sciences Primary Motivation for Bioinformatics Based on this 1-to-1 Mapping of Sequence to Structure • Still very much an OPEN PROBLEM • • • • Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 76 PSP: Goals • Accurate 3D structures. But not there yet. • Good “guesses” • Working models for researchers • Understand the FOLDING PROCESS • Get into the Black Box • Only hope for some proteins • 25% won’t crystallize, too big for NMR • Best hope for novel protein engineering • Drug design, etc. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 77 PSP: Major Hurdles • Energetics • We don’t know all the forces involved in detail • Too computationally expensive BY FAR! • Conformational search impossibly large • 100 a.a. protein, 2 moving dihedrals, 2 possible positions for each diheral: 2200 conformations! • Levinthal’s Paradox • Longer than time of universe to search • Proteins fold in a couple of seconds?? • Multiple-minima problem Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 78 Protein Folding • What we DO know... • Protein folding is FAST!! • Typically a couple of seconds • Folding is CONSISTENT!! • Involves weak forces – Non-Covalent • Hydrogen Bonding, van der Waals, Salt Bridges • Mostly, 2-STATE systems • VERY FEW INTERMEDIATES • Makes it hard to study – BLACK BOX Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 79 Protein Folding • What we DON’T know... • Mechanism...? • Forces...? • Relative contributions? • Hydrophobic Force thought to be critical Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 80 Secondary Structure Prediction • Much simpler to predict a small set of classes than to predict 3-D coordinates of atoms. • Amino acids have different propensities for • (a) alpha helices, • (b) beta sheets and • (c) turns. • Homology can also be used since fold is more conserved than sequence. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 Buch 11.2 81 Problem Statement • Predicting Secondary Protein Structure from amino acid Sequences • Secondary Protein Structure: The local shape of polypeptide chain dependent on the hydrogen bonds between different amino acids • In the big picture, the goal is to solve tertiary/quaternary structure in 3D. By solving the more easily tackled secondary structure problem in 2D, we’re solving an easier problem. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 82 Protein Structure Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 83 Goals, Challenges, Techniques Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 Secondary Structure Prediction • Given a protein sequence a1a2…aN, secondary structure prediction aims at defining the state of each amino acid ai as being either H (helix), E (extended=strand), or O (other) (Some methods have 4 states: H, E, T for turns, and O for other). • The quality of secondary structure prediction is measured with a “3-state accuracy” score, or Q3. Q3 is the percent of residues that match “reality” (X-ray structure). Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 85 Creating a Primary-to-Secondary Structure Predictor Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 86 The Task Given the sequence (primary structure) of a protein, predict its secondary structure. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 87 Predict what? • There are many types of secondary structure. • Which do we want to predict? • • • • • • • • Alpha helix Beta strand Beta turn Random coil Pi-helices 310-helices Type I turns … Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 88 Why do it? • Is secondary structure prediction useful? • Short answer: yes • Long answer: • The original hope was to “bootstrap” from secondary to tertiary prediction; this goal remains elusive… • Secondary structure can give clues to function since many enzymes, DNA binding proteins, membrane proteins have characteristic secondary structures. Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 89 How can we do it? • How would you predict the secondary structure state of each residue (amino acid) in a protein? • Besides the sequence itself, what else would you want to use? • What kind of computer algorithms would help? • ??? Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 90 Types of Prediction Methods Tim Conrad, VL Algorithmische Bioinformatik, WS2015/2016 91 Mehr Informationen im Internet unter medicalbioinformatics.de/teaching Vielen Dank! Tim Conrad AG Medical Bioinformatics www.medicalbioinformatics.de Weitere Fragen