* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download slides - Indiana University Computer Science Department
Cell-penetrating peptide wikipedia , lookup
History of molecular evolution wikipedia , lookup
Non-coding DNA wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Gene expression wikipedia , lookup
Gene regulatory network wikipedia , lookup
Community fingerprinting wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Genome evolution wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Synthetic biology wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Overview of I519/I617 & Introduction to Bioinformatics Yuzhen Ye ([email protected]) School of Informatics and Computing, IUB Structure of I519 Two classes and one lab each week Python & C (& R) Textbook: Understanding Bioinformatics Homework assignments (~5 in total) Grading: – midterm exam (30%) + final exam (25%) + assignments (30%) + class project (15%) Course webpage: http://mendel.informatics.indiana.edu/~yye/lab/teaching/fall2011-I519.php What’s bioinformatics What’s Bioinformatics "Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. There are three important sub-disciplines within bioinformatics: the development of new algorithms and statistics with which to assess relationships among members of large data sets; the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and the development and implementation of tools that enable efficient access and management of different types of information.” (NCBI) "I do not think all biological computing is bioinformatics, e.g. mathematical modelling is not bioinformatics, even when connected with biology-related problems. In my opinion, bioinformatics has to do with management and the subsequent use of biological information, particular genetic information.” (Durbin) What’s bioinformatics Bioinformatics vs Computational Biology Almost interchangeable Computational biology may be broader – Computational biology is an interdisciplinary field that applies the techniques of computer science, applied mathematics and statistics to address biological problems (wikipedia) – Includes bioinformatics What’s bioinformatics Impacts of Bioinformatics On biological sciences (and medical sciences) – Large scale experimental techniques – Information growth On computational sciences – Biological has become a large source for new algorithmic and statistical problems! What’s bioinformatics Related Fields Proteomics/genomics/metagenomics/ comparative genomics/structural genomics Chemical informatics Health informatics/Biomedical informatics Complex systems Systems biology Biophysics Mathematical biology – tackles biological problems using methods that need not be numerical and need not be implemented in software or hardware What’s bioinformatics Bioinformatics Problems/Applications Figure from “Bioinformatics dummies” Biology primer Biology Primer Eggs Cell divisions Multicullar organisms Figure 1-1 Molecular Biology of the Cell Underlying the diversity of life is a striking unity: DNA is universal genetic language; Cells are the basic units of structure and function Biology primer Cells are the Basic Unit of Life Cell Theory – – – – All organisms are made up of cells The cell is the basic living unit of organization for all organisms All cells come from pre-existing cells by division Cells contains hereditary information which is passed from cell to cell during cell division. – All cells are basically the same in chemical composition – All energy flow (metabolism & biochemistry) of life occurs within cells Organisms can be of single cells or multiple cells (multicellular organisms) − Most living organisms are single cells (e.g., E.coli, Yeast) − Multicellular organisms (e.g., human has more than 1013 cells. Have no idea about this number? World population as of July 2008 is 6.684 billion, (1 billion = 109) Biology primer Cell Structures Animal cell structure Prokaryotic cell structure http://hyperphysics.phy-astr.gsu.edu/hbase/biology/imgbio/cellhlabel.gif http://micro.magnet.fsu.edu/cells/procaryotes/images/procaryote.jpg Biology primer Scale Down to the Atomic Level Cell Figure 9-1 Molecular Biology of the Cell Figure 9-2 Biology primer The Central Dogma The flow of genetic information in cells is from DNA to RNA to protein. All cells, from bacteria to humans, express their genetic information in this way—a principle so fundamental that it is termed the central dogma of molecular biology. Transcription DNA Translation RNA retrovirus RNA virus Protein Biology primer DNA and Replication Figure 1-2 Molecular Biology of the Cell, Fifth Edition Biology primer From DNA (to RNA) to Protein Biology primer The Genetic Code Biology primer Genome Definition – Genome of an organism is its whole hereditary information and is encoded in the form of DNA (or, for some viruses, RNA) – Chromosome: structure composed of a long DNA and associated proteins; human has 46 chromosomes DNA sequences can be determined by various sequencing techniques Sequence first. Ask questions later – Cell. 2002 Oct 4;111(1):13-6 Biology primer Three (Super)Kingdoms Characteristic Archaea Bacteria Eukaryote s Predominately multicellular No No Yes DNA structure circular circular linear Cytoplasma is compartmentalized No No Yes Introns are present in most genes No No Yes Photosynthesis with chlorophyll No Yes Yes Histone proteins present in cell Yes No Yes Cell 14 Biology primer Organisms at Pivotal Positions in the Tree of Life Fly: 2000 Worm: 1998 E.coli: 1997 Cell. 2002 Oct 4;111(1):13-6 Figure 1. Concepts in Phylogeny as It Relates to Comparative Genomics (A) Tree of select organisms (large font: whole-genome sequence obtained or slated for sequencing) and the higher t represent, drawn to emphasize major innovations in our evolutionary history. Notice that there is something of an evol Biology primer Model Organisms A model organism is a species that is extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the organism model will provide insight into the workings of other organisms. Genetic models (with short generation times, such as the fruit fly and nematode worm), experimental models, and genomic models, with a pivotal position in the tree of life Biology primer Escherichia coli (E. coli) A common gut bacterium, is the most widelyused organism in molecular genetics. Some strains of E. coli are capable of causing disease under certain conditions Different strains of E. coli have been extensively studied Whole genome of several E. coli strains was sequenced (e.g., K-12, O157:H7, HS) Biology primer The Genome of E. coli K-12 Circular DNA: a single, closed loop Protein-coding genes RNA genes The whole genome was sequenced in 1997 Total 4,639,221 bp. Figure 1-29 Molecular Biology of the Cell, Fifth Edition (© Garland Science 2008) Biology primer Caenorhabditis elegans C. elegans is a eukaryote (nematodes, or round worms) Has small genome (~97megabases) (whole genome sequencing, 1998) C. elegans is easy to maintain in the laboratory (in petri dishes) and has a fast and convenient life cycle. – the life span is 2-3 weeks. – tiny (1 mm in length) and transparent organism and the developmental pattern of all 959 of its somatic cells has been traced. • somatic cell: any cell of a plant or animal other than cells of the germ line (from Greek soma, body) Biology primer Caenorhabditis elegans (Cont.) Discovery of the mechanism of RNA interference in C. elegans (1998) – Andrew Fire and Craig C. Mello shared the Nobel Prize in Physiology or Medicine in 2006 – Silencing was triggered efficiently by injected dsRNA, but weakly or not at all by sense or antisense single-stranded RNAs Biology primer Drosophila melanogaster (fruit fly) It has been used as a model organism for over 100 years, widely used to study genetic and development biology – Small and has a simple diet. – Short life cycle: taking about two weeks – Have large polytene chromosomes, whose barcode patterns of light and dark bands allow genes to be mapped accurately It was chosen in 1990 as one of the model organisms to be studied under the auspices of the federally funded Human Genome Project Whole genome sequenced in 2000 >10 Drosophila genomes have been sequenced FlyBase: http://flybase.org/ Biology primer Species Classification Classification is arrangement of organisms into orderly groups based on their similarities Also known as taxonomy Provide accurate and uniform naming system Biology primer Linnaean System of Classification Carolus Linnaeus (the “father of taxonomy”) -- the first widely accepted hierarchical scheme, which consists today of 7 categories (kingdom, phylum, class, order, family, genus, and species) (not including domain) Species is the most basic unit of biological classification (means “kind” in Latin) – Each species is different, and reproduces itself faithfully – Heredity is a central part of the definition of life The Linnaean system uses two Latin name categories, genus and species, to designate each type of organism – Salmonela saintpaul (which caused the latest food-borne disease) – Capitalize the genus, but not the species; italicized in print Biology primer Homo sapiens Domain: Eukaryotes Kingdom: Matazon (many-celled animal) King Phylum: Chordata (characterized by a notochord, nerve cord, and gill slits) Philip (subphylum: Vertebrata) Class: Mammalia (warm-blooded vertebrates) Came Over Order: Primates For Family: Hominidae Gooseberry Genus: Homo Soup Species: Sapiens http://www.ncbi.nlm.nih.gov/sites/entrez?db=taxonomy Biology primer Gene/Protein Family A protein/gene family is a group of evolutionarily related proteins/genes Genes/proteins of the same family typically have similar functions (and structures for proteins) and with sequence similarity There are far more genes/proteins than the number of families—which shows the advantage of grouping genes/proteins into different families Biology primer Evolution of Genes New genes are generated from preexisting genes – Intragenic mutation (modified by changes in DNA sequence – errors occurred in the process of DNA replication) – Gene duplication – two copies of genes may then diverge in the course of evolution – Segment shuffling – Horizontal transfer More on what’s bioinformatics Analysis of Gene/Protein Families – Key Problems in Bioinformatics Homolog detection Alignment (the residual-level mapping among homologous genes/proteins) Application of the alignments – Detect the conserved residues – functional sites – Prediction of protein structures – Motif finding (cis-elements) Phylogeny Function annotation None of these problems have been solved! More on what’s bioinformatics Is Protein A Related/Similar to Protein B? Sequence similarity (alignment!) Structure similarity (structural comparison) Co-expression (Microarray data analysis) Any types of correlation (operon-structure, etc) You will see this question again and again! More on what’s bioinformatics Guilty by Association More on what’s bioinformatics Computational Abstractions: Biological Sequences as Strings DNA DNA RNA Protein RNA Protein A string in a four-letter alphabet Phylotype More on what’s bioinformatics Computational Abstractions: Networks (and Others) as Graphs Protein-protein interaction network Protein structures presented as graphs Gene functions presented as graphs (Gene ontology) Metabolic pathways as graphs (directed) More on what’s bioinformatics More than Implementation Find old/new biological problems – Remember biology has become a large source for new algorithmic and statistical problem Formulate as a computational problem – Define inputs and outputs – (though there are many paper work on welldefined bioinformatics problems) Apply existing algorithms and/or tools to solving your problem Develop new ones if necessary Implement your algorithms with appropriate programming language(s) More on what’s bioinformatics Where Can I Get the Biological Data? Sequences – NCBI genbank – Swissprot Structures – PDB Genomes – NCBI, IMG, GOLD – Specialized genome resources • Ensembl: selected eukaryotic genomes; not true anymore—release 19 (July 2013) includes a total number of 6440 genomes! Others – KEGG, NCBI SRA, etc More on what’s bioinformatics Dealing with Databases Databases are the backbone of bioinformatics research Flat files were the first type of database; and are still used today Rational databases are good for searching purposes Databases can contain data and annotations of data – Primary and derived (secondary) data Buzz Word: Big Data “Big data is new and “ginormous” and scary – very, very scary. No, wait. Big data is just another name for the same old data marketers have always used, and it’s not all that big, and it’s something we should be embracing, not fearing. No, hold on. That’s not it, either. What I meant to say is that big data is as powerful as a tsunami, but it’s a deluge that can be controlled . . . in a positive way, to provide business insights and value. Yes, that’s right, isn’t it?” Ref: http://www.forbes.com/sites/lisaarthur/ 2013/08/15/what-is-big-data/ Biologists Join Big-data Club “Biologists are joining the big-data club. With the advent of high-throughput genomics, life scientists are starting to grapple with massive data sets, encountering challenges with handling, processing and moving information that were once the domain of astronomers and high-energy physicists” “Much of the construction in big-data biology is virtual, focused on cloud computing — in which data and software are situated in huge, off-site centres that users can access on demand, so that they do not need to buy their own hardware and maintain it on site. ” Biology: The big challenges of big data, Nature 498, 255–260 (13 June 2013) Big Data 2 Big Knowledge (BD2K) “I’m talking enormous quantities—think tera-, peta-, and even exa-bytes. The challenge presented by this revolution is the need to develop and implement hardware and software that can store, retrieve, and analyze this mountain of complex data—and transform it into knowledge that can improve our understanding of human health and disease.” A post by Dr. Francis Collins (July 23, 2013; NIH Director’s Blog) Different ways of doing computing As a user, you have many choices – Download the tools to your local machine – Run the tool in a supercomputer • Yes, IU has several powerful supercomputers (newest addition is BigRedII). – Use a web server – Use a Cloud • Galaxy – An app on your smart phone? • See a survey at https://cs.wmich.edu/elise/courses/ cs603-bio/SII-12/Presentation1-Jason.pdf Similarly, as a developer, you also have many choices Readings Biology primer (available at the course website) Anything about Python and/or C (if you have no programming experience at all) Biology: The big challenges of big data What’s in the textbook? – Chapter 1 (The Nucleic Acid World) – Chapter 2 (Protein Structure) – Chapter 3 (Dealing With Databases)