* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Sequence - BIOTEC - Biotechnology Center TU Dresden
Survey
Document related concepts
Magnesium transporter wikipedia , lookup
Metalloprotein wikipedia , lookup
Molecular ecology wikipedia , lookup
Biochemistry wikipedia , lookup
Expression vector wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Interactome wikipedia , lookup
Gene expression wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Western blot wikipedia , lookup
Point mutation wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Transcript
Introduction based on Chapter 1 Lesk, Introduction to Bioinformatics Michael Schroeder BioTechnological Center TU Dresden Biotec Contents Molecular biology primer The role of computer science Phylogeny Sequence Searching Protein structure Clinical implications Read chapter 1 By Michael Schroeder, Biotec, 2 23 June 2000: Draft of Human genome sequenced! 1953: Watson and Crick discover the structure of DNA 2000: Draft of human genome is published “The most wondrous map ever produced by human kind” “One of the most significant scientific landmarks of all time, comparable with the invention of the wheel or the splitting of the atom” By Michael Schroeder, Biotec, 3 High-throughput biomedicine Microarrays Measure activity of thousands of genes at the same time Example: Cancer Compare activity with and without drug treatment Result: Hundreds of candidate drug targets RNAi (Noble prize 2004, Fire and Mello) Knock-down genes and observe effect Example: Infectious diseases Which proteins orchestrate entry into cell? Result: Hundreds of candidate proteins Atomic force microscopes (Noble prize Binnig) Pull protein out of membrane and measure force Example: Eye diseases resulting fomr misfolding Result: Hundreds of candidate residues By Michael Schroeder, Biotec, 4 Drug Discovery 80 New Drugs 70 R&D spendings 20 15 60 50 10 40 30 5 20 10 R&D spendings ($ Billion) New drugs per year Challenge: Longer time to market, fewer drugs, exploding costs Approach: Use of compound libraries and highthroughput screening 0 0 60 65 By Michael Schroeder, Biotec, 70 75 80 Year 85 90 95 5 HTS and Bioinformatics High-throughput technologies have completely changed the work of biomedical researchers Challenge: Interpret (often large) results of screens Approach: Before running secondary assays use bioinformatics and IT to assemble all possible information By Michael Schroeder, Biotec, 6 Good News Number of PubMed Abstracts 14,000,000 >1.000.000 Sequences 12,000,000 >16.000.000 Articles 10,000,000 8,000,000 6,000,000 4,000,000 2,000,000 0 1960 1970 1980 1990 2000 Year Molecular Biology Database List at Nucleic Acids Research >30.000 3D Structures number of data sources 800 700 600 500 400 >700 DBs/Tools 300 200 100 0 2000 By Michael Schroeder, Biotec, 2001 2002 2003 year 2004 2005 7 2010 Bad News: Data != Knowledge How to analyse data, how to integrate data? Comptuer science to the rescue… By Michael Schroeder, Biotec, 8 Examlpe: computer science is key for sequencing Human genome is a string of length 3.200.000.000 Shotgun sequencing: Break multiple copies of string into shorter substrings Example: shotgunsequencing shotgunsequencing shotgunsequencing cing en encing equ gun ing ns otgu seq sequ sh sho shot tg uenc un Computing problem: Assemble strings By Michael Schroeder, Biotec, 9 Computer science key for sequencing sh sho shot otgu tg gun un ns seq sequ equ uenc encing en cing ing By Michael Schroeder, Biotec, QUESTION: How can you handle long repetitive sequences? Heeeeelllllllllllooooooo QUESTION: Why was a draft announced? When was the final version ready? 10 Yersinia pestis Arabidopsis thaliana Buchnerasp. APS Caenorhabitis Campylobacter elegans jejuni Helicobacter pylori rat Chlamydia pneumoniae Mycobacterium leprae Rickettsia prowazekii mouse Aquifex aeolicus Vibrio cholerae Drosophila melanogaster Neisseria meningitidis Z2491 Plasmodium falciparum Saccharomyces Salmonella cerevisiae enterica By Michael Schroeder, Biotec, Archaeoglobus Borrelia fulgidus burgorferi Bacillus subtilis Mycobacterium tuberculosis Escherichia Thermoplasma acidophilum coli Pseudomonas Ureaplasma aeruginosa urealyticum Thermotoga maritima Xylella fastidiosa 11 Break through of the year 2000 Next quest: Sequencing a genome for 1000$ By Michael Schroeder, Biotec, 12 Quantity and quality of data lead to ambitious goals Understand integrative aspects of the biology of organisms Interrelate sequence, three-dimensional structure, interactions, function of proteins, nucleic acids and protein-nucleic acid complexes Travel in time backward (deduce events in evolutionary history) and forward (deliberate modification of biological systems) Applications in medicine, agriculture, and other scientific fields By Michael Schroeder, Biotec, 13 Scenario New virus (e.g. SARS) and goal to develop treatment Scientists isolate genetic material of virus Screen genome for relationships with previously studied viruses [10] From virus’ DNA they compute the proteins it produces [1] Compute proteins’ three-dimensional structure and thereby obtain clues about their functions Screen for similar proteins sequences with known structure [15] If any are found Then interpret difference (homology modelling) [25] Else predict structure from sequence [55] Identify or design small molecule blocking relevant active sites of the protein [50] Design antibodies to neutralize the virus [50] Index of problem difficulty: <30: solution exists already, >30: we cannot solve this (yet) By Michael Schroeder, Biotec, 14 Life in Time and Space Life A biological organism is a naturally-occurring, self-reproducing device that effects controlled manipulations of matter, energy and information Time Species evolve through natural mutation, recombination of genes in sexual reproduction, or direct gene transfer Read the past in contemporary genomes Space Species occupy local ecosystems Species are composed of organisms Organisms are composed of cells Cells are composed of molecules By Michael Schroeder, Biotec, 15 DNA – the molecule of life By Michael Schroeder, Biotec, http://www.ornl.gov/hgmis 16 Proteins 20 naturally occurring amino acids in proteins Non-polar G glycine, A alanine, P proline, V valine I isoleucine, L leucine, F phenylalanine, M methionine Polar S serine, C cysteine, T threonine, N asparagine Q glutamine, H histidine, Y tyrosine, W tryptophan Charged D aspartic acid, E glutamic acid, K lysine, R arginine Other classification H,F,Y,W are aromatic and play role in membrane proteins Distinguish atg = adenine-thymine-guanine and ATG = Alanine-Threonine-Glycine By Michael Schroeder, Biotec, 17 The genetic code First Position (5’ end) T C A G T TTT TTC TTA TTG CTT CTC CTA CTG ATT ATC ATA ATG GTT GTC GTA GTG Phe Phe Leu Leu Leu Leu Leu Leu Ile Ile Ile Met* Val Val Val Val By Michael Schroeder, Biotec, C TCT TCC TCA TCG CCT CCC CCA CCG ACT ACC ACA ACG GCC GCC GCA GCG Second Position A Ser TAT Ser TAC Ser TAA Ser TAG Pro CAT Pro CAC Pro CAA Pro CAG Thr AAT Thr AAC Thr AAA Thr AAG Ala GAT Ala GAC Ala GAA Ala GAG Tyr Tyr Stop Stop His His Gln Gln Asn Asn Lys Lys Asp Asp Glu Glu G TGT TGC TGA TGG CGT CGC CGA CGG AGT AGC AGA AGG GGT GGC GGA GGG Cys Cys Stop Trp Arg Arg Arg Arg Ser Ser Arg Arg Gly Gly Gly Gly Third Position (3’ end) T C A G T C A G T C A G T C A G 18 Protein Structure DNA: Nucleotides are very similar and hence the structure of DNA is very uniform Proteins: Great variety in threedimensional conformation to support diverse structure and functions If heated, protein “unfolds” to biologically-inactive structure; in normal conditions protein folds By Michael Schroeder, Biotec, 19 Paradox Translation from DNA sequence to amino acid sequence is very simple to describe, but requires immensely complicated machinery (ribosome, tRNA) The folding of the protein sequence into its threedimensional structure is very difficult to describe But occurs spontaneously By Michael Schroeder, Biotec, 20 Central Dogma DNA sequence determines protein sequence Protein sequence determines protein structure Protein structure determines protein function By Michael Schroeder, Biotec, 21 Observables and Data Archives Databases in molecular biology cover Nucleic acid and protein sequences, Macromolecular structures and functions Archival databanks of biological information DNA and protein sequences including annotations Nucleic acid and protein structures including annotations Protein expression patterns Derived Databases Sequence motifs (“signatures” of protein families) Mutations and variants in DNA and protein sequences Classification or relationships (e.g. hierarchy of structures) Bibliographic databases (PubMed with 17M abstracts) Collections of links to web sites of databases By Michael Schroeder, Biotec, 22 What is Bioinformatics Bioinformatics is the marriage of biology and information technology Bioinformatics is an integrated multidisciplinary field Covers computational tools and methods for managing, analysing and manipulating sets of biological data Disciplines include: biochemistry, genetics, structural biology, artificial intelligence, machine learning, software engineering, statistics, database theory, information visualisation, algorithm design By Michael Schroeder, Biotec, 23 Bioinformatics Has three components Creation of databases Development of algorithms to analyse data Use of these tools for analysing biological data By Michael Schroeder, Biotec, 24 Databases: Types of Queries 1/2 1. Given a sequence (fragment), find sequences in the database that are similar to it 2. Given a protein structure (or fragment), find protein structures in the database that are similar to it 3. Given sequence of a protein of unknown structure, find structures in the database that adopt similar threedimensional structures 4. Given a protein structure, find sequences in the database that correspond to similar structures. By Michael Schroeder, Biotec, 25 Databases: Given sequence, find structure 3. Given sequence of a protein of unknown structure, find structures in the database that adopt similar three-dimensional structures. But How? Easy: Find similar sequences with known structure! But: There might be similar structures, whose sequence is not similar! 4. Given a protein structure, find sequences in the database that correspond to similar structures. But How? Easy: Find similar structures and hence sequences But: There are so many more sequences with unknown structure that the above method will have only very limited success 1 and 2 are solved, 3 and 4 are active fields of research By Michael Schroeder, Biotec, 26 Databases: Types of Queries 2/2 E.g. for which proteins of known structure involved in disease of disrupted purine biosynthesis in humans, are there related proteins in yeast? Solution: Virtual databases that provide transparent access to a number of underlying data sources and query and analysis tools By Michael Schroeder, Biotec, 27 Databases: Curation and Quality Problems: Given that there are primary and secondary databases, how to control updates, how to propagate change, how to maintain consistency? Contents (experimental results, annotations, supplementary information) all have there own source of error Older data were limited by older techniques By Michael Schroeder, Biotec, 28 Databases: Annotation Experimental data (e.g. raw DNA sequence) needs to be enriched with annotations Source of data Investigators responsible Relevant publication Feature tables (e.g. coding regions) Problems: (often) lack of controlled and coherent vocabulary Computer parseable Automated annotation needed SwissProt = ca. 130.000 annotated sequences TrEMBL = ca. 850.000 unannotated sequences Maintanence of annotations (what if error detected?) By Michael Schroeder, Biotec, 29 Computers and Computer Science Relevant areas: Artificial Intelligence Machine Learning Neural networks, rulebased learning Datamining Association rules Software Engineering Design, implementation, testing of software Programming Object-oriented C++, Java Imperative: C, Modula, Pascal, Cobol, Fortran Logic: Prolog Funtional: ML Scripting: Perl, Python By Michael Schroeder, Biotec, Statistics Database theory Design and maintenance of databases How to index sequences, time series, 3D strucutres Information Visualisation Graph drawing, diagrams, cartoons, 3D graphics Algorithm design Complexity of algorithms Efficient data structures 30 Programming We will use Python Scripting language Supports string processing well Widely used in bioinformatics By Michael Schroeder, Biotec, 31 Biological Classification and Nomenclature Back in 18th century, Linnaeus, a Swedish naturalist, classified living things according to a hierarchy: Kingdom, Phylum, Class, Order, Family, Genus, Species Generally only genus and species are used for identification Homo sapiens Drosophila melanogastor Bos taurus Linnaeus’ classification based on observed similarity Widely reflects biological ancestry By Michael Schroeder, Biotec, 32 Classification of Humans and Fruit Flies Kingdom: Phylum: Class: Order: Family: Genus: Species: By Michael Schroeder, Biotec, Animalia Chordata Mammalia Primata Hominidae Homo sapiens Animalia Chordata Insecta Diptera Drosophilidae Drosophila melanogastor 33 Homology = derived from common ancestor Characteristics derived from a common ancestor are called homologous E.g. eagle’s wing and human’s arm Other apparently similar characteristics may have arisen independently by convergent evolution E.g. eagle’s wing and bee’s wing. The most common ancestor of eagles and bees did not have wings Homologous characters may diverge functionally E.g. bones in human middle and jaws of primitive fish By Michael Schroeder, Biotec, 34 Sequence analysis and Homology Sequence analysis gives unambiguous evidence for relationship of species For higher organisms sequence analysis and the classical tools of comparative anatomy, palaeontology, and embryology are often consistent For microorganisms there are problems Classical methods: how to describe features Sequence analysis: lateral gene transfer By Michael Schroeder, Biotec, 35 Domains of Life Ribosomal RNA is present in all organisms Based on 15S ribosomal RNAs life is divided Bacteria No nucleus (procaryote) E.g. tuberculosis and E. coli Archaea No nucleus (procaryote) few organisms living in hostile environments (termophiles, halophiles, sulphur reducers, methanogens) Eukarya Has a nucleus contained in membrane Nucleus contains chromosomes Internal compartments called organelles for specialised biological processes Area outside nucleus and organelles called cytoplasm E.g. yeast and human beings By Michael Schroeder, Biotec, 36 Eukaryotic cell By Michael Schroeder, Biotec, 37 Domains of Life By Michael Schroeder, Biotec, 38 Example: Use of sequences to determine phylogenetic relationships Use ExPASy (www.expasy.ch/cgi-bin/sprot-search-ful) to search for pancreatic ribonuclease for horse (Equus caballus), minke whale (Balaenoptera acutorostrata), red kangaroo (Macropus rufus) sp|P00674|RNP_HORSE Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Equus caballus (Horse). KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTF VHEPLADVQAICLQKNITCKNGQSNCYQSSSSMHITDCRLTSGSKY PNCAYQTSQKERHIIVACEGNPYVPVHFDASVEVST Use sequence alignment to determine evolutionary relationship By Michael Schroeder, Biotec, 39 Sequence alignment Global match: align all of one with all of the other sequence (mismatches, insertions, deletions) And.--so,.from.hour.to.hour.we.ripe.and.ripe |||| |||||||||||||||||||||||| |||||| And.then,.from.hour.to.hour.we.rot-.and.rot- Local match: find region in one sequence that matches the other (mismatches, insertions, deletions ; ends can be ignored) My.care.is.loss.of.care,.by.old.care.done, ||||||||| ||||||||||||| |||||| || Your.care.is.gain.of.care,.by.new.care.won By Michael Schroeder, Biotec, 40 Sequence alignment Motif search: find matches of short sequence in long sequence Option: perfect, 1 mismatch, mismatches+gaps+insertions+deletions match |||| for the watch to babble and to talk is most tolerable By Michael Schroeder, Biotec, 41 Sequence alignment Multiple sequence alignment No.sooner.---met.--------.but.they.look’d No.sooner.look’d.--------.but.they.lo-v’d No.sooner.lo-v’d.--------.but.they.sigh’d No.sooner.sigh’d.--------.but.they.--asked.one.another.the.reason No.sooner.knew.the.reason.but.they.-------------sought.the.remedy No.sooner. .but.they. By Michael Schroeder, Biotec, 42 Example: Multiple alignment Use sequence alignment to determine evolutionary relationship… Example: horse, whale and kangoroo Expected: horse and whale are placental mammals, kangoroo is marsupial Multiple alignment with CLUSTAL-W (www.ebi.ac.uk/clustalw) By Michael Schroeder, Biotec, 43 FASTA format >sp|P00674|RNP_HORSE Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Equus caballus (Horse). KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF DASVEVST >sp|P00673|RNP_BALAC Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Balaenoptera acutorostrata (Minke whale) (Lesser rorqual). RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF DNSV >sp|P00686|RNP_MACRU Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Macropus rufus (Red kangaroo) (Megaleia rufa). ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQE NVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEGQYVPVHFDA YV By Michael Schroeder, Biotec, 44 Multiple Alignment with ClustalW (www.ebi.ac.uk/clustalw) CLUSTAL W (1.82) multiple sequence alignmen sp|P00674|RNP_HORSE sp|P00673|RNP_BALAC sp|P00686|RNP_MACRU KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ 60 RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ 60 -ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQ 59 *:** **:*****: :......*** ** *.**.* ***:***:**. *.*:* * KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF 120 KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF 120 ENVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEG-QYVPVHF 118 :*: ****::***:*.* : **:** *..****** *:**: :::******* ****** DASVEVST 128 DNSV---- 124 DAYV---- 122 * * By Michael Schroeder, Biotec, 45 Example: Number of Aligned Residues Horse and Minke whale: Minke whale and Red kangoroo: Horse and Red kangoroo: 95 82 75 Conclusion: Horse and whale share the most identical resiues By Michael Schroeder, Biotec, 46 Example: Elephant and Mammoth Mitochondrial cytochrome b from Siberian woolly mammoth (Mammuthus primigenius) preserved in arctic perma frost African elephant (Loxodonta africana) Indian elephant (Elephans maximus) By Michael Schroeder, Biotec, 47 Indian elephant: sp|P24958|CYB_LOXAF Mammoth: sp|P92658|CYB_MAMPR African elephant: sp|O47885|CYB_ELEMA MTHIRKSHPLLKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 MTHIRKSHPLLKILNKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 MTHTRKFHPLFKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 *** ** ***:**:********************************************** TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120 TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120 TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120 ************************************************************ LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180 LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFA 180 LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180 **************************************:********************* LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240 LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILFLL 240 FHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240 :********:***********************************************:** LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300 LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300 LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSILI 300 ******************************************************:***** LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEYPYIIIGQMASILYFS 360 LGIMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEYPYIIIGQMASILYFS 360 LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEHPYIIIGQMASILYFS 360 **:*************************: *** **********:*************** IILAFLPIAGVIENYLIK 378 IILAFLPIAGMIENYLIK 378 IILAFLPIAGMIENYLIK 378 **********:******* By Michael Schroeder, Biotec, 48 Example: Elephant and Mammoth Mammoth and African elephant have 10 mismatches, mammoth and Indian elephant 14. Significant? By Michael Schroeder, Biotec, 49 Similarity and Homology Important difference: Similarity is the measurement of resemblance of sequences Homology: common ancestor Similarity is gradual, homology is either true or false Similarity = now, homology = past events Homology is only very rarely directly observed (e.g. lab population, clinical study of viral infection) Homology is inferred from sequence similarity By Michael Schroeder, Biotec, 50 Example: Homology/Similarity The assertion that the cytocrome b sequences are homologues means that there is a common ancestor BUT: 1. Maybe cytochrome b functionally requires so many conserved residues and will hence occur in many species ( In fact, This is not the case here) 2. Maybe cytochrome b has to function this way in elephant-like species, but in fact started out from different ancestors (i.e. convergent evolution) 3. Maybe mammoth and African elephant have only fewer mismatches, because Indian elephant’s DNA mutated faster 4. Maybe all of them acquired cytochrome b through a virus (horizontal gene transfer) By Michael Schroeder, Biotec, 51 Example: Conclusion Classical methods confirm that for pancreatic ribonuclease inferring homology from similarity is justified But to answer whether Mammoth are closer to African or Indian elephants is too close to call Problems with inferring phylogeny from gene and protein sequences Wide range of variation (possibly below statistical significance) Different rates of evolution for different branches of the evolutionary tree By Michael Schroeder, Biotec, 52 Inferring Phylogenies with SINES and LINES Requirements: ‘all-or-none’ character Irreversible appearance Solution: SINES and LINES (Short and Long Interspersed Nuclear Elements) Repetitive, non-coding sequences in eukaryotic genomes >30% in human genome, >50% in some plants SINES = 70-500 base pairs long, up to 106 copies LINES up to 7000 base pairs, up to 105 copies They enter genome by reverse transcription of RNA By Michael Schroeder, Biotec, 53 A practical example: Fatherhood The picture shows a Southern blot of DNA from different family members, probed using a minisatellite. You can work out which of F1 and F2 is the father of child C, by observing which bands they have in common. (Reproduced from "Essential Medical Genetics" by M.Connor and M.Ferguson-Smith, with permission from Blackwell Science). By Michael Schroeder, Biotec, 54 Why SINES are useful in phylogeny Either present or absent Inserted at random in non-coding portion of genome i.e. SINE has no important function so that convergent evolution can be excluded Presence of a SINE in two species and absence in a third implies that first two species are more closely related SINE insertion appears to be irreversible Temporal order Presence of a SINE in two species and absence in a third implies that ancestor of first two species is younger than ancestor of all three By Michael Schroeder, Biotec, 55 Example revisited What is the closest land-based relative of the whales Classical palaeontology links Cetacea (whales, dolphins, porpoises) with Arteriodactyla (including e.g. cattle) Belief that Cetaceans diverged before Arteriodactyla split into suborder Suiformes (e.g. pigs), Tylopoda (e.g. camels, llamas), Ruminantia (e.g. deer, cattle, goats, sheep, antelopes, giraffe) Sequence comparison results Based on mitochondrial DNA, pancreatic ribonuclease, fibrinogen, and others Closest relatives of whales are hippopotamuses (They share 4 SINES) These two are closest to Ruminantia By Michael Schroeder, Biotec, 56 Searching for Similar Sequences with PSI-Blast Any search method for sequences should be Sequence Database Sensitive: also pick up distant relationships Selective: reported relationships are true Example: database with (among others) 1000 globin sequences Globin familiy (oxygen transport) of proteins occurs in many species Proteins have same function and structure and positives: But there are pairs of membersTrue of the family sharing less than 10% 700 out of 900 identical residues By Michael Schroeder, Biotec, False negatives: 300 out of 1000 are not found 1000 Globin Sequences 900 Search results are really globins False positives: 200 out of 900 are not globins 57 Searching for Distant Relationships with PSI-BLAST How can we find distant relationships without increasing the false negatives? PSI-BLAST: Position Sensistive Iterated – Basic Linear Alignment Sequence Tool Identifies patterns within the sequences Score via intermediaries may be better than score from direct comparison A 50% B 50% C Only 10% By Michael Schroeder, Biotec, 58 PSI-BLAST Example Human PAX-6 gene (SwissProt ID P26367) has homologues in many different species PSI-Blast at NCBI site www.ncbi .nlm.nih.gov By Michael Schroeder, Biotec, 59 Result BLASTP 2.2.6 [Apr-09-2003] RID: 1062117117-16602-2157828.BLASTQ3 Query= gi|6174889|sp|P26367|PAX6_HUMAN Paired box protein Pax-6 (Oculorhombin) (Aniridia, type II protein). (422 letters) Database: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF 1,509,571 sequences; 486,132,453 total letters Results of PSI-Blast iteration 1 Sequences with E-value BETTER than threshold Sequences producing significant alignments: Score E (bits) Value gi|4505615|ref|NP_000271.1| paired box gene 6 isoform a; Paired box h... gi|189353|gb|AAA59962.1| oculorhombin >gi|189354|gb|AAA59963.1| oculo... gi|6981334|ref|NP_037133.1| paired box homeotic gene 6 [Rattus norveg... gi|26389393|dbj|BAC25729.1| unnamed protein product [Mus musculus] gi|7305369|ref|NP_038655.1| paired box gene 6; small eye; Dickie's sm... gi|383296|prf||1902328A PAX6 gene gi|4580424|ref|NP_001595.2| paired box gene 6 isoform b; Paired box h... gi|18138028|emb|CAC80516.1| paired box protein [Mus musculus] gi|2576237|dbj|BAA23004.1| PAX6 protein [Gallus gallus] gi|27469846|gb|AAH41712.1| Similar to paired box gene 6 [Xenopus laevis] … By Michael Schroeder, Biotec, 781 780 778 776 776 775 775 773 770 768 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 60 Introduction to Protein Structure Proteins play a variety of roles: Structural (viral coat proteins, horny outer layer of human and animal skin, cytoskeleton) Catalysis of chemical reactions (enzymes) Transport and Storage (e.g. haemoglobin) Regulation (e.g. hormones) Receptor and signal transduction Genetic transcription Recognition (cell adhesion molecules) Antibodies and other proteins of the immune system By Michael Schroeder, Biotec, 61 Proteins Are large molecules Only small part – the active site – is functional Evolve by structural changes produced by mutations in the amino acid sequence Ca. 40000 proteins structures are now known Can be obtained by X-ray crystallography or nuclear magnetic resonance (NMR) By Michael Schroeder, Biotec, 62 Structure of Proteins Backbone and sidechain Residue i-1, Residue i, Residue i+1, Si-1 Si Si+1 | | | …N-Cα-C-N-Cα-C-N-Cα-C-… || || || O O O Sidechain (variable) Mainchain (constant) Polypeptide chain folds into a curve in space Common structural feature Alpha-helix Beta-sheet By Michael Schroeder, Biotec, 63 Hierarchy of Architecture Primary structure: Amino acid sequence Secondary structure: Helices, sheets, loops, hydrogen-bonding pattern of main chain Tertiary structure: Assembly and interactions of helices, sheets, etc. Quaternary structure: Assembly of monomers Evolution can merge proteins Five enzymes in E. coli that catalyze successive steps in biosynthesis of aromatic amino acids correspond to one protein in Aspergillus nidulans Globins form tetramers in mammalian haemoglobin and dimers in ark clam Scaoharca inaequivalvis By Michael Schroeder, Biotec, 64 Protein Structure Triosephosphate isomerase from Bacillus stearothermophilus Highly efficient enzyme appearing in most species By Michael Schroeder, Biotec, 65 Hierarchy of Architecture: supersecondary structure Alpha-helix hairpin Beta hairpin Beta-alpha-beta unit By Michael Schroeder, Biotec, 66 Hierarchy of Architecture Supersecondary structures: Alpha-helix hairpin Beta hairpin Beta-alpha-beta unit Domains: Compact unit, single chain, independent stability Modular proteins: Multi-domain Copies of related domains or “mix-and-match” By Michael Schroeder, Biotec, 67 Classification of Protein Structure All Alpha: mostly alpha helices All Beta: mostly beta sheets Alpha+Beta: Helices and sheets in different parts of the molecule, no beta-alpha-beta units Alpha/Beta: Helices and sheets assembled from beta-alpha-beta units Alpha/Beta linear Alpha/Beta barrel Little or no secondary structure By Michael Schroeder, Biotec, 68 SCOP: Structural Classification of Proteins top CLASS All alpha (218) All Beta (144) Alpha+Beta (279) Alpha/Beta (136) FOLD Trypsin-like serine proteases (1) Immunoglobulin-like (23) SUPERFAMILY =evolutionary related, similar structure, not necessarily similar sequence Transglutaminase (1) Immunoglobulin (6) FAMILY = set of domains with similar sequence C1 set domains (antibody constant) By Michael Schroeder, Biotec, V set domains (antibody variable) 69 Pymol By Michael Schroeder, Biotec, 70 Engrailed homeodomain (1enh) Transcription factor important in developend Used to study protein folding Utrophin calmodulin homology domain (1bhd) Actin binding Closely relatd to dystrophin, whose lack causes muscular dystrophies (weak muscles) Cytochrome c, rice (1ccr) Electron transport across mitochondrial membrane By Michael Schroeder, Biotec, DNA-binding domain of HIN recombinase (1hcr) 71 Fibronectin III domain (1fna) Found on cell surface Mannose-binding protein (1npl) Barnase (1brn) Cleaves RNA and is lethal if intracellular and not inhibited by barstar By Michael Schroeder, Biotec, TATA-box-binding protein (1cdw) 72 OB-domain from Lys-tRNA synthetase (1bbw) Scytalone dehydratase (3std) Alcohol dehydrogenase, NADbinding domain (1ee2) Break down of alcohol into simpler compounds By Michael Schroeder, Biotec, Adenylate kinase (3adk) Energy production 73 Chemotaxis receptor methyltransferase (1af7) Thiamine phosphate synthase (2tps) Pancreatic spasmolytic polypeptide (2psp) By Michael Schroeder, Biotec, 74 Protein Structure Prediction and Engineering If sequence of amino acids contains enough information to specify three-dimensional structure of proteins, it should be possible to devise algorithm for prediction Secondary structure prediction: Which segments of the sequence are helices, which strands? Fold recognition: Given library of known structures with their sequences and a sequence with unknown structure, can we find the structure that is most similar Homology modelling Given two homologous sequences, one with one without structure. If more than 50% of the residues are identical the structure can serve as a model By Michael Schroeder, Biotec, 75 Critical Asessment of Structure Prediction (CASP) Chicken lysozyme Baboon alpha-lactalbumin KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGS KQFTKCELSQNLY--DIDGYGRIALPELICTMFHTSGYDTQAIVEND-ES Chicken lysozyme Baboon alpha-lactalbumin TDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVS TEYGLFQISNALWCKSSQSPQSRNICDITCDKFLDDDITDDIMCAKKILD Chicken lysozyme Baboon alpha-lactalbumin DGN-GMNAWVAWRNRCKGTDVQA-WIRGCRLI--KGIDYWIAHKALC-TEKL-EQWL--CE-K By Michael Schroeder, Biotec, 76 Clinical Implications Fast and reliable diagnosis of disease and risk: With symptoms In advance of appearance (e.g. Huntington) In utero (e.g. cystic fibrosis: mutation in cystic fibrosis transmembrane conductance regulator (CFTR), which is a chloride ion channel Genetic counselling Customized treatment E.g. childhood leukaemia is treated with toxic drug 6-mercaptopurine. Small fraction of patients used to die as they lack enzyme thiopurine methyltransferase. Identify drug targets ½ are receptors, ¼ are enzymes, ¼ are hormones 7% have unknown targets Gene therapy Replace defective genes or supply gene products (insulin for diabetes and Blood Factor VIII for haemophilia) However: Most diseases do not have a single genetic cause! By Michael Schroeder, Biotec, 77 Quick check By now you should Have read chapter 1 Know the main data sources (sequence and structure) Know the role that bioinformatics plays Understand the difference between homology and similarity Understand what sequence comparison and alignment are Understand how they can be useful for phylogenetic studies Understand primary, secondary, tertiary structure Be able to assess the assumptions made and the quality of data By Michael Schroeder, Biotec, 78