Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to bioinformatics Sylvia B. Nagl What is bioinformatics? • an emerging interdisciplinary research area • deals with the computational management and analysis of biological information: genes, genomes, proteins, cells, ecological systems, medical information, robots, artificial intelligence... The Core of Bioinformatics to date •Relationships between TDQAAFDTNIVTLTRFVM EQGRKARGTGEMTQLLNS LCTAVKAISTAVRKAGIA HLYGIAGSTNVTGDQVKK LDVLSNDLVINVLKSSFA TCVLVTEEDKNAIIVEPE KRGKYVVCFDPLDGSSNI DCLVSIGTIFGIYRKNST DEPSEKDALQPGRNLVAA GYALYGSATMLV sequence 3D structure protein functions •Properties and evolution of genes, genomes, proteins, metabolic pathways in cells •Use of this knowledge for prediction, modelling, and design “The holy grail of bioinformatics” GCTCCTCACTGTCTGTGTTTATTC TTTTAGCTTCTTCAGATCTTTTAG TCTGAGGAAGCCTGGCATGTGCA AATGAAGTTAACCTAA... > 500, 000 genes sequenced to date Expected number of unique protein structures: ~ 700-1, 000 Basic concepts • conceptual foundations of bioinformatics: evolution protein folding protein function • bioinformatics builds mathematical models of these processes to infer relationships between components of complex biological systems Information processing in cells nucleic acids proteins coding regions regulatory sites transcripts One-to-many mappings! Context-dependence! Global approaches: Toward a new Systems Biology Global cell state Genome Protein population: proteomics Genome activation patterns: transcriptomics •How does the spatial and temporal organisation of living matter give rise to biological processes? Organisation: tissue imaging EM X-ray, NMR cells molecular complexes Global approaches: Toward a new Systems Biology Perturbation Living cell Biological knowledge (computerised) Sequence information Dynamic response •Basic principles “Virtual cell” Structural information Bioinformatics Mathematical modelling Simulation •Practical applications We do not know yet whether the information in the genome is sufficient to reconstruct an entire biological system. Information on building blocks not enough, information on their interactions is essential. External environment Internal environment Metabolic net Genetic networks DNA hRNA mRNAs proteins Bioinformatics in context Mathematics/ computer science Genomics Molecular biology Ethical, legal, and social implications Bioinformatics Biophysics Molecular evolution Current challenges to users • Potential hurdles: Methods are in flux and not fully developedscattered and heterogeneous resources • Remedies: Web resources navigation guides integration of tools and databanks http://www.biochem.ucl.ac.uk/~nagl/bioinformatics.html Sequence homology search of the genome of Plasmodium falciparum Target identification for antimalerial drugs The search for new antimalarial drugs • Malaria is one of the leading causes of morbidity and mortality in the tropics. • 300 to 500 million estimated clinical cases and 1.5 million to 2.7 million deaths per year. • Nearly all fatal cases are caused by Plasmodium falciparum. • The parasite's resistance to conventional antimalarial drugs such as chloroquine is growing at an alarming rate. •P. falciparum has a plastidlike organelle, called the apicoplast, acquired by endosymbiosis of an alga. Jomaa et al. (1999) •Self-replicating, maternally inherited (35kb, circular DNA). •Comparative genome analysis: Search for orthologs. Apicoplast contains enzymes found in plant and bacterial, but not animal metabolic pathways. •Potential target for antimalerial drugs: DOXP reductoisomerase Jomaa et al. (1999) Science 285: 1573-1576: Biological databases The challenge (Boguski, 1999) In 1995, the number of genes in the database started to exceed the number of papers on molecular biology and genetics in the literature! Data types primary data sequence AATGCGTATAGGC DNA DMPVERILEALAVE amino acid secondary data secondary protein structure “motifs”: regular expressions, blocks, profiles, fingerprints primary database secondary db e. g., alpha-helices, betastrands tertiary data tertiary protein structure atomic co-ordinates domains, folding units tertiary db Primary biological databases • Nucleic acid EMBL GenBank DDBJ (DNA Data Bank of Japan) • Protein PIR MIPS SWISS-PROT TrEMBL NRL-3D International nucleotide data banks EMBL GenBank Europe EMBL EBI USA NLM NCBI International Advisory Meeting Collaborative Meeting TrEMBL DDBJ Japan NIG CIB NRDB GenBank file format GenBank file format Swiss-Prot SWISS-PROT file format SWISS-PROT file format SWISS-PROT file format SWISS-PROT file format Other primary protein databases • TrEMBL (translated EMBL) in SWISS-PROT format rapid access to sequence data from genome projects computer-annotated supplement to SWISS-PROT translations of all coding sequences (CDS) in EMBL • SP-TrEMBL • REM-TrEMBL: immunoglobulins, T-cell receptors, short fragments, synthetic and patented sequences Other primary protein databases The Protein Information Resource (PIR) • integrated system of protein sequence databases and derived related databases, e. g., alignment databases • rapid searching, comparison, and pattern matching of protein sequences • retrieval of descriptive, bibliographic, feature, and concurrent cross-reference information • aims to be comprehensive and consistently annotated PIR: related databases NRL-3D Sequence-Structure Database • produced by PIR from sequence and annotation information extracted from three-dimensional structures in the Protein Databank (PDB) • allows keyword and similarity searches PIR: related databases PATCHX integrated with PIR • a non-redundant database of protein sequences produced by MIPS, the European branch of PIRInternational The PIR Protein Sequence Database and PATCHX together provide the most complete collection of protein sequence data currently available in the public domain. Composite protein sequence dbs NRDB OWL MIPSX(PIR+PATCHX) SP+TrEMBL PIR PIR PIR TrEMBL SP SP SP SP PDB GenBank MIPSOwn GenPept NRL-3D NRL-3D MIPSH PIRMOD MIPSTrn EMTrans GBTrans Kabat PseqIP OWL composite database By accession number • By database code • By text • By sequence • By title • By author • By query language • By regular expression OWL only released every 6-8 weeks Direct OWL access: OWL Blast server Two other useful sites INFOBIOGEN-The Public Catalog of Databases http://www.infobiogen.fr/services/dbcat/ KEGG-Kyoto Encyclopedia of Genes and Genomes http://www.genome.ad.jp/kegg/ Kyoto Encyclopedia of Genes and Genomes (KEGG) is an effort to computerize current knowledge of molecular and cellular biology in terms of the information pathways that consist of interacting molecules or genes and to provide links from the gene catalogs produced by genome sequencing projects. Sequence Retrieval System (SRS) Database browser that allows users to •retrieve •link •access entries from all interconnected resources. Users can formulate queries across a range of different database types. Guide to Protein Databases: http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture1/index .html http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture2/index .html With thanks to Dr Roman Laskowski.