* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Sequence - BIOTEC - Biotechnology Center TU Dresden
Genetic code wikipedia , lookup
Molecular ecology wikipedia , lookup
Metalloprotein wikipedia , lookup
Biochemistry wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Expression vector wikipedia , lookup
Gene expression wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Interactome wikipedia , lookup
Western blot wikipedia , lookup
Point mutation wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Introduction based on Chapter 1 Lesk, Introduction to Bioinformatics Michael Schroeder BioTechnological Center TU Dresden Biotec Contents n n n n n n Molecular biology primer The role of computer science Phylogeny Sequence Searching Protein structure Clinical implications n Read chapter 1 By Michael Schroeder, Biotec, 2 23 June 2000: Draft of Human genome sequenced! n 1953: Watson and Crick discover the structure of DNA n 2000: Draft of human genome is published n “The most wondrous map ever produced by human kind” n “One of the most significant scientific landmarks of all time, comparable with the invention of the wheel or the splitting of the atom” By Michael Schroeder, Biotec 3 High-throughput biomedicine n Microarrays n Measure activity of thousands of genes at the same time n Example: n Cancer n Compare activity with and without drug treatment n Result: Hundreds of candidate drug targets n RNAi (Noble prize 2004, Fire and Mello) n Knock-down genes and observe effect n Example: n Infectious diseases n Which proteins orchestrate entry into cell? n Result: Hundreds of candidate proteins n Atomic force microscopes (Noble prize Binnig) n Pull protein out of membrane and measure force n Example: n Eye diseases resulting fomr misfolding n Result: Hundreds of candidate residues By Michael Schroeder, Biotec 4 Drug Discovery n Challenge: Longer time to market, fewer drugs, exploding costs n Approach: Use of compound libraries and highthroughput screening By Michael Schroeder, Biotec, 5 HTS and Bioinformatics n High-throughput technologies have completely changed the work of biomedical researchers n Challenge: Interpret (often large) results of screens n Approach: Before running secondary assays use bioinformatics and IT to assemble all possible information By Michael Schroeder, Biotec 6 Good News Number of PubMed Abstracts 14,000,000 >1.000.000 Sequences >16.000.000 Articles 12,000,000 10,000,000 8,000,000 6,000,000 4,000,000 2,000,000 0 1960 1970 1980 1990 2000 Year Molecular Biology Database List at Nucleic Acids Research >30.000 3D Structures number of data sources 800 700 600 500 400 >700 DBs/Tools 300 200 100 0 2000 By Michael Schroeder, Biotec 2001 2002 2003 year 2004 2005 7 2010 Bad News: Data != Knowledge n How to analyse data, how to integrate data? n Comptuer science to the rescue… By Michael Schroeder, Biotec 8 Examlpe: computer science is key for sequencing n Human genome is a string of length 3.200.000.000 n Shotgun sequencing: Break multiple copies of string into shorter substrings n Example: n shotgunsequencing shotgunsequencing shotgunsequencing n cing en encing equ gun ing ns otgu seq sequ sh sho shot tg uenc un n Computing problem: Assemble strings By Michael Schroeder, Biotec 9 Computer science key for sequencing n sh n sho n shot n otgu n tg n gun n un n ns n seq n sequ n equ n uenc n encing n en n cing n ing By Michael Schroeder, Biotec QUESTION: How can you handle long repetitive sequences? Heeeeelllllllllllooooooo QUESTION: Why was a draft announced? When was the final version ready? 10 Yersinia pestis Arabidopsis thaliana Buchnerasp. APS Caenorhabitis Campylobacter elegans jejuni Helicobacter pylori rat Chlamydia pneumoniae Mycobacterium leprae Rickettsia prowazekii mouse Aquifex aeolicus Vibrio cholerae Drosophila melanogaster Neisseria meningitidis Z2491 Plasmodium falciparum Saccharomyces Salmonella cerevisiae enterica By Michael Schroeder, Biotec Archaeoglobus Borrelia fulgidus burgorferi Bacillus subtilis Mycobacterium tuberculosis Escherichia Thermoplasma acidophilum coli Pseudomonas Ureaplasma aeruginosa urealyticum Thermotoga maritima Xylella fastidiosa 11 Break through of the year 2000 Next quest: Sequencing a genome for 1000$ By Michael Schroeder, Biotec 12 Quantity and quality of data lead to ambitious goals n Understand integrative aspects of the biology of organisms n Interrelate sequence, three-dimensional structure, interactions, function of proteins, nucleic acids and protein-nucleic acid complexes n Travel in time n backward (deduce events in evolutionary history) and n forward (deliberate modification of biological systems) n Applications in medicine, agriculture, and other scientific fields By Michael Schroeder, Biotec 13 Scenario n n n n n New virus (e.g. SARS) and goal to develop treatment Scientists isolate genetic material of virus Screen genome for relationships with previously studied viruses [10] From virus’ DNA they compute the proteins it produces [1] Compute proteins’ three-dimensional structure and thereby obtain clues about their functions n Screen for similar proteins sequences with known structure [15] n If any are found n Then interpret difference (homology modelling) [25] n Else predict structure from sequence [55] n Identify or design small molecule blocking relevant active sites of the protein [50] n Design antibodies to neutralize the virus [50] n Index of problem difficulty: n <30: solution exists already, n >30: we cannot solve this (yet) By Michael Schroeder, Biotec 14 Life in Time and Space n Life n A biological organism is a naturally-occurring, self-reproducing device that effects controlled manipulations of matter, energy and information n Time n Species evolve through n natural mutation, n recombination of genes in sexual reproduction, or n direct gene transfer n Read the past in contemporary genomes n Space n n n n Species occupy local ecosystems Species are composed of organisms Organisms are composed of cells Cells are composed of molecules By Michael Schroeder, Biotec 15 DNA – the molecule of life By Michael Schroeder, Biotec, http://www.ornl.gov/hgmis 16 Proteins n 20 naturally occurring amino acids in proteins n Non-polar n G glycine, A alanine, P proline, V valine n I isoleucine, L leucine, F phenylalanine, M methionine n Polar n S serine, C cysteine, T threonine, N asparagine n Q glutamine, H histidine, Y tyrosine, W tryptophan n Charged n D aspartic acid, E glutamic acid, K lysine, R arginine n Other classification n H,F,Y,W are aromatic and play role in membrane proteins n Distinguish n atg = adenine-thymine-guanine and n ATG = Alanine-Threonine-Glycine By Michael Schroeder, Biotec, 17 The genetic code First Position (5Õend) T C A G T TTT TTC TTA TTG CTT CTC CTA CTG ATT ATC ATA ATG GTT GTC GTA GTG Phe Phe Leu Leu Leu Leu Leu Leu Ile Ile Ile Met* Val Val Val Val By Michael Schroeder, Biotec, C TCT TCC TCA TCG CCT CCC CCA CCG ACT ACC ACA ACG GCC GCC GCA GCG Second Position A TAT Ser TAC Ser TAA Ser TAG Ser CAT Pro CAC Pro CAA Pro CAG Pro AAT Thr AAC Thr AAA Thr AAG Thr GAT Ala GAC Ala GAA Ala GAG Ala Tyr Tyr Stop Stop His His Gln Gln Asn Asn Lys Lys Asp Asp Glu Glu G TGT TGC TGA TGG CGT CGC CGA CGG AGT AGC AGA AGG GGT GGC GGA GGG Cys Cys Stop Trp Arg Arg Arg Arg Ser Ser Arg Arg Gly Gly Gly Gly Third Position (3Õend) T C A G T C A G T C A G T C A G 18 Protein Structure n DNA: n Nucleotides are very similar and hence the structure of DNA is very uniform n Proteins: n Great variety in threedimensional conformation to support diverse structure and functions n If heated, protein “unfolds” to biologically-inactive structure; in normal conditions protein folds By Michael Schroeder, Biotec 19 Paradox n Translation from DNA sequence to amino acid sequence n is very simple to describe, n but requires immensely complicated machinery (ribosome, tRNA) n The folding of the protein sequence into its threedimensional structure n is very difficult to describe n But occurs spontaneously By Michael Schroeder, Biotec 20 Central Dogma n DNA sequence determines protein sequence n Protein sequence determines protein structure n Protein structure determines protein function By Michael Schroeder, Biotec 21 Observables and Data Archives n Databases in molecular biology cover n Nucleic acid and protein sequences, n Macromolecular structures and functions n Archival databanks of biological information n DNA and protein sequences including annotations n Nucleic acid and protein structures including annotations n Protein expression patterns n Derived Databases n Sequence motifs (“signatures” of protein families) n Mutations and variants in DNA and protein sequences n Classification or relationships (e.g. hierarchy of structures) n Bibliographic databases (PubMed with 17M abstracts) n Collections n of links to web sites n of databases By Michael Schroeder, Biotec 22 What is Bioinformatics n Bioinformatics is the marriage of biology and information technology n Bioinformatics is an integrated multidisciplinary field n Covers computational tools and methods for managing, analysing and manipulating sets of biological data n Disciplines include: n biochemistry, genetics, structural biology, artificial intelligence, machine learning, software engineering, statistics, database theory, information visualisation, algorithm design By Michael Schroeder, Biotec, 23 Bioinformatics n Has three components n Creation of databases n Development of algorithms to analyse data n Use of these tools for analysing biological data By Michael Schroeder, Biotec, 24 Databases: Types of Queries 1/2 n 1. Given a sequence (fragment), find sequences in the database that are similar to it n 2. Given a protein structure (or fragment), find protein structures in the database that are similar to it n 3. Given sequence of a protein of unknown structure, find structures in the database that adopt similar threedimensional structures n 4. Given a protein structure, find sequences in the database that correspond to similar structures. By Michael Schroeder, Biotec, 25 Databases: Given sequence, find structure n 3. Given sequence of a protein of unknown structure, find structures in the database that adopt similar three-dimensional structures. But How? n Easy: Find similar sequences with known structure! n But: There might be similar structures, whose sequence is not similar! n 4. Given a protein structure, find sequences in the database that correspond to similar structures. But How? n Easy: Find similar structures and hence sequences n But: There are so many more sequences with unknown structure that the above method will have only very limited success n 1 and 2 are solved, 3 and 4 are active fields of research By Michael Schroeder, Biotec, 26 Databases: Types of Queries 2/2 n E.g. for which proteins of known structure involved in disease of disrupted purine biosynthesis in humans, are there related proteins in yeast? n Solution: Virtual databases that provide transparent access to a number of underlying data sources and query and analysis tools By Michael Schroeder, Biotec, 27 Databases: Curation and Quality n Problems: n Given that there are primary and secondary databases, n how to control updates, n how to propagate change, n how to maintain consistency? n Contents (experimental results, annotations, supplementary information) all have there own source of error n Older data were limited by older techniques By Michael Schroeder, Biotec, 28 Databases: Annotation n Experimental data (e.g. raw DNA sequence) needs to be enriched with annotations n n n n Source of data Investigators responsible Relevant publication Feature tables (e.g. coding regions) n Problems: n (often) lack of controlled and coherent vocabulary n Computer parseable n Automated annotation needed n SwissProt = ca. 540.000 annotated sequences n TrEMBL = ca. 40 Mio unannotated sequences n Maintanence of annotations (what if error detected?) By Michael Schroeder, Biotec, 29 Computers and Computer Science n Relevant areas: n Artificial Intelligence n Machine Learning n Neural networks, rulebased learning n Datamining n Association rules n Software Engineering n Design, implementation, testing of software n Programming n Object-oriented C++, Java n Imperative: C, Modula, Pascal, Cobol, Fortran n Logic: Prolog n Funtional: ML n Scripting: Perl, Python By Michael Schroeder, Biotec, n Statistics n Database theory n Design and maintenance of databases n How to index sequences, time series, 3D strucutres n Information Visualisation n Graph drawing, diagrams, cartoons, 3D graphics n Algorithm design n Complexity of algorithms n Efficient data structures 30 Programming n We will use Python n Scripting language n Supports string processing well n Widely used in bioinformatics By Michael Schroeder, Biotec, 31 Biological Classification and Nomenclature n Back in 18th century, Linnaeus, a Swedish naturalist, classified living things according to a hierarchy: Kingdom, Phylum, Class, Order, Family, Genus, Species n Generally only genus and species are used for identification n Homo sapiens n Drosophila melanogastor n Bos taurus n Linnaeus’ classification based on observed similarity n Widely reflects biological ancestry By Michael Schroeder, Biotec, 32 Classification of Humans and Fruit Flies n n n n n n n Kingdom: Phylum: Class: Order: Family: Genus: Species: By Michael Schroeder, Biotec, Animalia Chordata Mammalia Primata Hominidae Homo sapiens Animalia Chordata Insecta Diptera Drosophilidae Drosophila melanogastor 33 Homology = derived from common ancestor n Characteristics derived from a common ancestor are called homologous n E.g. eagle’s wing and human’s arm n Other apparently similar characteristics may have arisen independently by convergent evolution n E.g. eagle’s wing and bee’s wing. The most common ancestor of eagles and bees did not have wings n Homologous characters may diverge functionally n E.g. bones in human middle and jaws of primitive fish By Michael Schroeder, Biotec, 34 Sequence analysis and Homology n Sequence analysis gives unambiguous evidence for relationship of species n For higher organisms sequence analysis and the classical tools of comparative anatomy, palaeontology, and embryology are often consistent n For microorganisms there are problems n Classical methods: how to describe features n Sequence analysis: lateral gene transfer By Michael Schroeder, Biotec, 35 Domains of Life n Ribosomal RNA is present in all organisms n Based on 15S ribosomal RNAs life is divided n Bacteria n No nucleus (procaryote) n E.g. tuberculosis and E. coli n Archaea n No nucleus (procaryote) n few organisms living in hostile environments (termophiles, halophiles, sulphur reducers, methanogens) n Eukarya n Has a nucleus contained in membrane n Nucleus contains chromosomes n Internal compartments called organelles for specialised biological processes n Area outside nucleus and organelles called cytoplasm n E.g. yeast and human beings By Michael Schroeder, Biotec, 36 Eukaryotic cell By Michael Schroeder, Biotec, 37 Domains of Life By Michael Schroeder, Biotec, 38 Example: Use of sequences to determine phylogenetic relationships Use ExPASy (www.expasy.ch) to search for pancreatic ribonuclease for horse (Equus caballus), minke whale (Balaenoptera acutorostrata), red kangaroo (Macropus rufus) >sp|P00674|RNP_HORSE Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Equus caballus (Horse). KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTF VHEPLADVQAICLQKNITCKNGQSNCYQSSSSMHITDCRLTSGSKY PNCAYQTSQKERHIIVACEGNPYVPVHFDASVEVST Use sequence alignment to determine evolutionary relationship By Michael Schroeder, Biotec, 39 Sequence alignment 1. Global match: align all of one with all of the other sequence (mismatches, insertions, deletions) And.--so,.from.hour.to.hour.we.ripe.and.ripe |||| |||||||||||||||||||||||| |||||| And.then,.from.hour.to.hour.we.rot-.and.rot- 2. Local match: find region in one sequence that matches the other (mismatches, insertions, deletions ; ends can be ignored) My.care.is.loss.of.care,.by.old.care.done, ||||||||| ||||||||||||| |||||| || Your.care.is.gain.of.care,.by.new.care.won By Michael Schroeder, Biotec, 40 Sequence alignment 3. Motif search: find matches of short sequence in long sequence Option: perfect, 1 mismatch, mismatches+gaps+insertions+deletions match |||| for the watch to babble and to talk is most tolerable By Michael Schroeder, Biotec, 41 Sequence alignment 4. Multiple sequence alignment No.sooner.---met.--------.but.they.look’d No.sooner.look’d.--------.but.they.lo-v’d No.sooner.lo-v’d.--------.but.they.sigh’d No.sooner.sigh’d.--------.but.they.--asked.one.another.the.reason No.sooner.knew.the.reason.but.they.-------------sought.the.remedy No.sooner. .but.they. By Michael Schroeder, Biotec, 42 Example: Multiple alignment Use sequence alignment to determine evolutionary relationship… Example: horse, whale and kangaroo Expected: horse and whale are placental mammals, kangaroo is marsupial Multiple alignment with CLUSTAL-W (http://www.genome.jp/tools/clustalw) multiple sequence alignment computer program main parameters: gap opening/extension penalty By Michael Schroeder, Biotec, 43 FASTA format >sp|P00674|RNP_HORSE Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Equus caballus (Horse). KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF DASVEVST >sp|P00673|RNP_BALAC Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Balaenoptera acutorostrata (Minke whale) (Lesser rorqual). RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF DNSV >sp|P00686|RNP_MACRU Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Macropus rufus (Red kangaroo) (Megaleia rufa). ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQE NVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEGQYVPVHFDA YV By Michael Schroeder, Biotec, 44 Multiple Alignment with ClustalW (http://www.genome.jp/tools/clustalw) CLUSTAL W (1.82) multiple sequence alignmen sp|P00674|RNP_HORSE sp|P00673|RNP_BALAC sp|P00686|RNP_MACRU KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ 60 RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ 60 -ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQ 59 *:** **:*****: :......*** ** *.**.* ***:***:**. *.*:* * KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF 120 KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF 120 ENVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEG-QYVPVHF 118 :*: ****::***:*.* : **:** *..****** *:**: :::******* ****** DASVEVST 128 DNSV---- 124 DAYV---- 122 * * By Michael Schroeder, Biotec, 45 Example: Number of Aligned Residues Horse and Minke whale: Minke whale and Red kangoroo: Horse and Red kangoroo: 95 82 75 Conclusion: Horse and whale share the most identical residues By Michael Schroeder, Biotec, 46 New Example: Elephant and Mammoth Mitochondrial cytochrome b from Siberian woolly mammoth (Mammuthus primigenius) preserved in arctic permafrost African elephant (Loxodonta africana) Indian elephant (Elephans maximus) Q: To which one is the Mammuth more closely related? By Michael Schroeder, Biotec, 47 Indian elephant: sp|P24958|CYB_LOXAF Mammoth: sp|P92658|CYB_MAMPR African elephant: sp|O47885|CYB_ELEMA MTHIRKSHPLLKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 MTHIRKSHPLLKILNKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 MTHTRKFHPLFKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 *** ** ***:**:********************************************** TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120 TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120 TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120 ************************************************************ LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180 LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFA 180 LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180 **************************************:********************* LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240 LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILFLL 240 FHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240 :********:***********************************************:** LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300 LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300 LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSILI 300 ******************************************************:***** LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEYPYIIIGQMASILYFS 360 LGIMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEYPYIIIGQMASILYFS 360 LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEHPYIIIGQMASILYFS 360 **:*************************: *** **********:*************** IILAFLPIAGVIENYLIK 378 IILAFLPIAGMIENYLIK 378 IILAFLPIAGMIENYLIK 378 **********:******* By Michael Schroeder, Biotec, 48 Example: Elephant and Mammoth Mammoth and African elephant have 10 mismatches, Mammoth and Indian elephant 14. Significant? Q1: can we tell from these sequences alone that they are closely related? Q2: differences are small – do they come from selection, random noise or drift Strategies needed difference judging of similiarities By Michael Schroeder, Biotec, 49 Excursion: Similarity and Homology Important difference: Similarity is the measurement of resemblance of sequences Homology: common ancestor Similarity is gradual, homology is either true or false Similarity = now, homology = past events Homology is only very rarely directly observed (e.g. lab population, clinical study of viral infection) Homology is inferred from sequence similarity By Michael Schroeder, Biotec, 50 Example: Homology/Similarity The assertion that the cytochrome b sequences are homologues means that there is a common ancestor BUT: 1. Maybe cytochrome b functionally requires so many conserved residues and will hence occur in many species ( In fact, This is not the case here) 2. Maybe cytochrome b has to function this way in elephant-like species, but in fact started out from different ancestors (i.e. convergent evolution)Mammoth are homolgues – are also ribonuclease sequences homologues? Difference is much bigger 3. Maybe mammoth and african elephant have only fewer mismatches, because Indian elephant’s DNA mutated faster 4. Maybe all of them acquired cytochrome b through a virus (horizontal gene transfer) By Michael Schroeder, Biotec, 51 Examples: Conclusion Classical methods confirm that for pancreatic ribonuclease (Horse – whale - kangoroo) inferring homology from similarity is justified But to answer whether Mammoth are closer to African or Indian elephants is too close to call (non-significant) Problems with inferring phylogeny from gene and protein sequence comparison Wide range of variation (possibly below statistical significance) Different rates of evolution for different branches of the evolutionary tree Even if relationship - which sequence came first? By Michael Schroeder, Biotec, 52 Inferring Phylogenies with SINES and LINES Pylogeneticist’s dream of features: ‘all-or-none’ character Irreversible appearance Solution: SINES and LINES (Short and Long Interspersed Nuclear Elements) Repetitive, non-coding sequences in eukaryotic genomes >30% in human genome, >50% in some plants SINES = 70-500 base pairs long, up to 106 copies LINES up to 7000 base pairs, up to 105 copies They enter genome by reverse transcription of RNA By Michael Schroeder, Biotec, 53 A practical example: Fatherhood The picture shows a Southern blot of DNA from different family members, probed using a mini-satellite. You can work out which of F1 and F2 is the father of child C, by observing which bands they have in common. (Reproduced from "Essential Medical Genetics" by M.Connor and M.Ferguson-Smith, with permission from Blackwell Science.) By Michael Schroeder, Biotec, 54 Why SINES are useful in phylogeny Either present or absent Inserted at random in non-coding portion of genome i.e. SINE has no important function so that convergent evolution can be excluded Presence of a SINE in two species and absence in a third implies that first two species are more closely related SINE insertion appears to be irreversible Temporal order Presence of a SINE in two species and absence in a third implies that ancestor of first two species is younger than ancestor of all three By Michael Schroeder, Biotec, 55 Example revisited Q: What is the closest land-based relative of the whales? Classical palaeontology links Cetacea (whales, dolphins, porpoises) with Artiodactyla (including e.g. cattle) Belief that Cetaceans diverged before Artiodactyla split into suborder of Suiformes (e.g. pigs), Tylopoda (e.g. camels, llamas), Ruminantia (e.g. deer, cattle, goats, sheep, antelopes, giraffe) By Michael Schroeder, Biotec, 56 Example revisited Sequence comparison results Based on mitochondrial DNA, pancreatic ribonuclease, fibrinogen, and others Closest relatives of whales are hippopotamuses (share 4 SINES) These two are closest to Ruminantia By Michael Schroeder, Biotec, 57 Searching for Similar Sequences with PSI-Blast Any search method for sequences should be Sensitive: pick up distant relationships Selective: reported relationships are true False negatives: 300 out of 1000 are not found Sequence Database 1000 Globin Sequences Example: database with (among others) 1000 globin sequences Globin familiy (oxygen transport) of proteins occurs in many species Proteins have same function and structure But there are pairs of members of the family sharing less than 10% identical residues By Michael Schroeder, Biotec, 900 Search results True positives: 700 out of 900 are really globins False positives: 200 out of 900 are not globins 58 Searching for Distant Relationships with PSI-BLAST How can we find distant relationships without increasing the false negatives? PSI-BLAST: Position Sensitive Iterated – Basic Linear Alignment Sequence Tool Identifies conserved patterns within the sequences Improves Sens and Spec Score via intermediaries may be better than score from direct comparison A 50% B 50% C Only 10% By Michael Schroeder, Biotec, 59 PSI-BLAST Example Human PAX-6 gene (SwissProt ID P26367) has homologues in many different species (human, Drosophila, etc.) TF for eye development Mutations in: Human: no or deformed iris Drosophila: no eyes, expressed in wing or leg ectopic eyes PSI-Blast at NCBI site (www.ncbi.nlm.nih.gov) By Michael Schroeder, Biotec, 60 Result By Michael Schroeder, Biotec, 61 Result • Description of sequence • Max score – linked to data that show where sequences match • Total score - includes scores from non-contiguous portions of the subject sequence that match the query • Query coverage • Identity - % of a sequence with the highest percentage of identical bases • E-Value • Accession number – linked to Gene bank record By Michael Schroeder, Biotec, 62 Result BLASTP 2.2.28+ RID: 6D2U321501N Database: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects 33,121,465 sequences; 11,555,699,950 total letters Query= gi|6174889|sp|P26367.2|PAX6_HUMAN RecName: Full=Paired box protein Pax-6; AltName: Full=Aniridia type II protein; AltName: Full=Oculorhombin Length=422 Sequences producing significant alignments: ref|NP_000271.1| paired box protein Pax-6 isoform a [Homo sap... ref|XP_004264012.1| PREDICTED: paired box protein Pax-6 isofo... ref|XP_003910122.1| PREDICTED: paired box protein Pax-6 isofo... ref|XP_004683008.1| PREDICTED: paired box protein Pax-6 isofo... ref|XP_005064880.1| PREDICTED: paired box protein Pax-6 isofo... ref|NP_001035735.1| paired box protein Pax-6 [Bos taurus] >re... gb|AAA59962.1| oculorhombin [Homo sapiens] ref|NP_037133.1| paired box protein Pax-6 [Rattus norvegicus]... gb|EAW68233.1| paired box gene 6 (aniridia, keratitis), isofo... ... By Michael Schroeder, Biotec, Score (Bits) 870 869 869 869 868 868 868 868 869 E Value 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 63 Introduction to Protein Structure Proteins play a variety of roles: Structural (viral coat proteins, horny outer layer of human and animal skin, cytoskeleton) Catalysis of chemical reactions (enzymes) Transport and Storage (e.g. haemoglobin) Regulation (e.g. hormones) Receptor and signal transduction Genetic transcription Recognition (cell adhesion molecules) Antibodies and other proteins of the immune system By Michael Schroeder, Biotec, 64 Proteins Are large molecules Only small part – the active site – is functional Evolve by structural changes produced by mutations in the amino acid sequence Ca. 21.000 human proteins structures are now known Overall 90.000 protein structures in PDB Can be obtained by X-ray crystallography or nuclear magnetic resonance (NMR) By Michael Schroeder, Biotec, 65 Structure of Proteins Backbone and side chain Residue i-1, Residue i, Residue i+1, Si-1 Si Si+1 | | | …N-Cα-C-N-Cα-C-N-Cα-C-… || || || O O O Side chain (variable) Main chain (constant) Polypeptide chain folds into a curve in space Common structural feature Alpha-helix Beta-sheet Turns and Loops By Michael Schroeder, Biotec, 66 Hierarchy of Architecture Primary structure: Amino acid sequence Secondary structure: Helices, sheets, loops, hydrogen-bonding pattern of main chain Tertiary structure: Assembly and interactions of helices, sheets, etc. Quaternary structure: Assembly of monomers Evolution can merge proteins E.g.: 5 enzymes in E. coli = 1 protein in fungi Aspergillus nidulans catalyze successive steps in biosynthesis of aromatic amino acids E.g.: Globins form tetramers in mammalian haemoglobin and dimers in ark clam Scaoharca inaequivalvis By Michael Schroeder, Biotec, 67 Protein Structure DHAP to GAP in Glycolyse Triosephosphate isomerase from Bacillus stearothermophilus Highly efficient enzyme appearing in most species By Michael Schroeder, Biotec, 68 Extra layer of Architecture: supersecondary structure Alpha-helix hairpin Beta hairpin Beta-alpha-beta unit = Patterns of interaction between helices and sheets By Michael Schroeder, Biotec, 69 Hierarchy of Architecture Supersecondary structures: Alpha-helix hairpin Beta hairpin Beta-alpha-beta unit Domains: Compact unit, single chain, independent stability Modular proteins: Multi-domain Copies of related domains or “mix-and-match” By Michael Schroeder, Biotec, 70 Classification of Protein Structure All Alpha: mostly alpha helices All Beta: mostly beta sheets Alpha+Beta: Helices and sheets in different parts of the molecule, no beta-alpha-beta units Alpha/Beta: Helices and sheets assembled from beta-alpha-beta units Alpha/Beta linear Alpha/Beta barrel Little or no secondary structure By Michael Schroeder, Biotec, 71 SCOP: Structural Classification of Proteins top CLASS All alpha (284) All Beta (174) Alpha+Beta (376) Alpha/Beta (147) FOLD Trypsin-like serine proteases (1) Immunoglobulin-like (23) SUPERFAMILY = evolutionary related, similar structure, not necessarily similar sequence Transglutaminase (1) Immunoglobulin (6) FAMILY = set of domains with similar sequence By Michael Schroeder, Biotec, C1 set domains (antibody constant) V set domains (antibody variable) 72 Pymol By Michael Schroeder, Biotec, 73 Engrailed homeodomain (1enh) Transcription factor important in development Used to study protein folding Utrophin calmodulin homology domain (1bhd) Actin binding Closely relatd to dystrophin, whose lack causes muscular dystrophies (weak muscles) Cytochrome c, rice (1ccr) Electron transport across mitochondrial membrane By Michael Schroeder, Biotec, DNA-binding domain of HIN recombinase (1hcr) 74 Engrailed homeodomain (1enh) By Michael Schroeder, Biotec, 75 Fibronectin III domain (1fna) Found on cell surface Mannose-binding protein (1npl) Barnase (1brn) Cleaves RNA and is lethal if intracellular and not inhibited by barstar By Michael Schroeder, Biotec, TATA-box-binding protein (1cdw) 76 OB-domain from Lys-tRNA synthetase (1bbw) Scytalone dehydratase (3std) Alcohol dehydrogenase, NADbinding domain (1ee2) Break down of alcohol into simpler compounds By Michael Schroeder, Biotec, Adenylate kinase (3adk) Energy production 77 Chemotaxis receptor methyltransferase (1af7) Thiamine phosphate synthase (2tps) Pancreatic spasmolytic polypeptide (2psp) By Michael Schroeder, Biotec, 78 Protein Structure Prediction and Engineering If sequence of amino acids contains enough information to specify three-dimensional structure of proteins, it should be possible to devise algorithm for prediction Secondary structure prediction: Which segments of the sequence are helices, which strands? Fold recognition: Given library of known structures with their sequences and a sequence with unknown structure, can we find the structure that is most similar Homology modelling Given two homologous sequences, one with one without structure. If between 30 and 50% of the residues are identical, the structure can serve as a model By Michael Schroeder, Biotec, 79 Critical Asessment of Structure Prediction (CASP) Chicken lysozyme Baboon alpha-lactalbumin KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGS KQFTKCELSQNLY--DIDGYGRIALPELICTMFHTSGYDTQAIVEND-ES Chicken lysozyme Baboon alpha-lactalbumin TDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVS TEYGLFQISNALWCKSSQSPQSRNICDITCDKFLDDDITDDIMCAKKILD Chicken lysozyme Baboon alpha-lactalbumin DGN-GMNAWVAWRNRCKGTDVQA-WIRGCRLI--KGIDYWIAHKALC-TEKL-EQWL--CE-K By Michael Schroeder, Biotec, 80 Clinical Implications of Sequencing Fast and reliable diagnosis of disease and risk: Easy diagnosis (with symptoms) In advance of appearance (e.g. Huntington) In utero diagnosis (e.g. cystic fibrosis: thick secretions in lung) Genetic counselling Customized treatment (predict response to therapy/side effects) E.g. childhood leukaemia is treated with toxic drug 6-mercaptopurine. Small fraction of patients used to die as they lack enzyme thiopurine methyltransferase. Identify drug targets Nowadays targets are: ½ receptors, ¼ enzymes, ¼ hormones 7% have unknown targets Gene therapy Replace defective genes or supply gene products (insulin for diabetes and Blood Factor VIII for haemophilia) However: Most diseases do not have a single genetic cause! By Michael Schroeder, Biotec, 81 Quick check By now you should Have read chapter 1 Know the main data sources (sequence and structure) Know the role that bioinformatics plays Understand the difference between homology and similarity Understand what sequence comparison and alignment are Understand how they can be useful for phylogenetic studies Understand primary, secondary, tertiary structure Be able to assess the assumptions made and the quality of data By Michael Schroeder, Biotec, 82