Download Document

BIOINFORMATICS CSC 2500 Survey of Information Science Katherine W. McCain Associate Dean and Professor College of Information Science & Technology Drexel University [email protected] Overview of Today’s Session What is Bioinformatics Bioinformatics History and Growth The BIO in Bioinformatics The INFO in Bioinformatics What is Bioinformatics? Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. (NCBI) Bioinformatics is a combination of Computer Science, Information Technology, and Genetics to determine and analyse genetic information (lead editorial BITS Journal) What is Bioinformatics? Bioinformatics is conceptualizing biology in terms of macromolecules…and then applying “informatics” techniques…to understand and organize the information associated with these molecules, on a large scale (Gerstein’s lab at Yale) Bioinformatics is the use of computers to aid in extracting/processing biological information from a large set of data points (Brad Jameson—MCP Hahnemann University) Bioinformatics is the generation, visualization, analysis, storage and retrieval of large quantities of biological information (Michael Agostino) What is Bioinformatics Bioinformatics is the application of computer science to the management and analysis of biological information. In genome projects, bioinformatics includes: the development of methods to search databases quickly, the creation of algorithms to analyse DNA and protein sequences, and the creation of algorithms to predict gene structure from DNA sequences (adapted from DoubleTwist tutorial) More Definitions (from Gibas & Jambeck) Computational Biology: the application of quantitative analytical techniques in modeling biological systems. Informatics: the representation, organization, manipulation, distribution, maintenance, and use of information, particularly in digital form. Bioinformatics is frequently considered a subset of Computational Biology Subdisciplines of Bioinformatics  the development of new algorithms and statistics with which to assess relationships among members of large data sets  the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and  the development and implementation of tools that enable efficient access and management of different types of information. A “Map” of Bioinformatics From the website of Cynthia Gibas, Ph.D., faculty member at VaTech and author of good book on Computational Methods for Bioinformatics http://gibas.biotech.vt.edu/ Brief History of Bioinformatics (focusing on the molecular end) 1956 Sanger sequences first protein – bovine insulin 1959 Journal of Molecular Biology published Vol. 1 1966 Holley et al sequence first nucleic acid – yeast alanine tRNA 1967 Dayhoff publishes Atlas of Protein Sequences and Structure 1972 Protein Databank established – X-ray crystallographic protein structures 1974 Nucleic Acids Research published Vol. 1 1975 Sanger & Coulson publish technique for sequencing DNA— cited in ~750 articles 1977 Sanger publishes nucleotide sequence of bacteriophage øX174 Brief History of Bioinformatics (continued) 1981 Genbank, EMBL, DDBJ established 1985 Computer Applications in the Biosciences (now Bioinformatics) published Vol. 1 1987 SWISSPROT protein sequence database established 1990 Human Genome Project begun (DoE, NIH, many labs in US and elsewhere) 1994-95 Craig Venter establishes TIGR—sequences H. influenzae using “shotgun approach” 1996 Affymetrix produces the first commercial DNA chip 1998 Craig Venter establishes Celera (for-profit company) 2001 Human genome is “published” (DoE/NIH and Celera) 19 65 19 66 19 67 19 68 19 69 19 70 19 71 19 72 19 73 19 74 19 75 19 76 19 77 19 78 19 79 19 80 19 81 19 82 19 83 19 84 19 85 19 86 19 87 19 88 19 89 19 90 19 91 19 92 19 93 19 94 19 95 19 96 19 97 19 98 19 99 20 00 20 01 7000 6000 2001 5000 4000 3000 2000 1967 1000 1983 0 Growth of the Bioinformatics Literature in Medline The three institutions (GenBank, European Molecular Biology Laboratory, and DNA Data Bank of Japan) together have contributed 100 gigabases to Genbank. These 100,000,000,000 bases, or "letters" of the genetic code, represent both individual genes and partial and complete genomes of over 165,000 organisms. One hundred billion bases is about equal to the number of nerve cells in a human brain and a bit less than the number of stars in the Milky Way. MAJOR CHALLENGES FOR BIOINFORMATICS (Mike Agostino) •Data growth •Data volume (disk space, memory) •Data Retrieval •Data Reduction and Visualization •Data Integration •Rapidly changing field •New field—tools could be better The BIO in Bioinformatics Typical Animal Cell Structure http://www.emc.maricopa.edu/faculty/farabee/BIOBK/BioBookCELL2.html Where is the genetic information? In the nucleus of the cell (in eukaryotes)—in the form of long double strands of DNA that are coiled and bundled in the form of chromosomes. The Mammalian Cell Nucleus http://spectorlab.cshl.edu/domains.html The Genome—Organizational Hierarchy Genome: The total set of genes carried by an individual or cell. Chromosome: The DNA of eukaryotes is subdivided into chromosomes, each of which has a long length of DNA associated with various proteins. Each chromosome has a characteristic length and banding pattern. Humans have 23 pairs of chromosomes in the nucleus Drosophila has 4 pairs of chromosomes in the nucleus http://homepages.uel.ac.uk/V.K.Sieber/human.htm http://www.bio.psu.edu/flylabs/karotype.htm Where ELSE is the genetic information? In the mitochondrion, in the form of a loop of mtDNA. There are MANY mitochondria per cell (vs only one nucleus & 1 set of chromosomes) -- mtDNA has important forensic uses (maternal inheritance means that relatives can provide reference samples) See : the website of Mitotyping Technologies (a company that specializes in mtDNA forensics) for an overview. The Mitochondrion (1) Fawcett, A Textbook of Histology, Chapman and Hall, 12th edition, 1994 The Mitochondrion (2) About 90 percent of the energy needed by the body's tissues is made by the mitochondria, which store it in a molecule known as ATP, the end result of a long chain of chemical events. ATP then has to be carried out of the mitochondria into the main part of the cell. The Mitochondrion (3) From: MITOCHONDRIAL and METABOLIC DISORDERS- a primary care physician’s guide “The Spectrum of Mitochondrial Disease”. Robert K. Naviaux, MD, PhD http://biochemgen.ucsd.edu/mmdc/ep-3-10.pdf How many genes? Organism E. coli Yeast Nematode Mustard plant Fruit Fly Human Genome size 4.6 Mb 12.1 Mb 97 Mb 100 Mb Date # Genes 1997 1996 4,200 6,034 1998 2000 12,099 25,000 137 Mb 3000 Mb 2000 2001 13,061 39,000 Chromosome Features The Central Dogma DNARNAProtein Replication: DNA strand unwinds and new matching strands are constructed on the unwound templates Transcription: DNA strand unwinds and a single strand of RNA is constructed on one of the unwound templates Translation: the sequence of nucleotides in the RNA strand is “translated” into a sequence of amino acids that are joined to make a protein It’s really much more complicated than this…… Kinds of Nucleic Acids DNA (deoxyribonucleic acid)– double stranded molecule consisting of 4 nucleotides (adenine, guanine, cytosine, thymine) and a backbone of sugar/phosphate molecules. RNA (ribonucleic acid) – single stranded molecule consisting of 4 nucleotides (adenine, guanine, cytosine, uracil)  messenger RNA (mRNA) – carries the genetic information as a sequence of codons (3 nucleotide sequences) from the nucleus to the cytoplasm  ribosomal RNA (rRNA) – RNA in a cell organelle (ribosome) that binds to both mRNA and tRNA – the site of protein synthesis  transfer RNA (tRNA) – many kinds. Each binds to a specific amino acid Complementary DNA (cDNA) is DNA that has been “reverse engineered” by building a DNA strand on the mRNA strand in the cytoplasm The Genome—Organizational Hierarchy DNA: The genetic material of all cells and many viruses. A polymer of nucleotides. Each nucleotide consists of a sugar and a phosphate group (the “backbone”) linked to one of four bases : adenine, cytosine, guanine or thymine. Two complementary strands are wound in a right-handed helix and held together by hydrogen bonds between complementary base pairs. The sequence of bases encodes genetic information. http://www.accessexcellence.org/AB/GG/dna_molecule.html The Genome—Organizational Hierarchy We speak of “base pairs” because DNA is doublestranded, with adenine thymine and guaninecytosine. The strands are directional and complementary; The sequence of nucleotides reads the same starting from the “top” on one strand and the “bottom” on the other. Sequences are always read from the 5’ end to the 3’end (the two “connectors” of the sugar backbone) . http://www.accessexcellence.org/AB/GG/dna_molecule.html DNA Structure & Replication Check out the RealPlayer video: http://www.ucsd.tv/sciencematters/lesson1-col.shtml From Gene to Protein More Complexity… The genetic information transcribed from the DNA strand to the RNA strand is a “gene” plus—a “transcription unit.” It consists of several parts: • control segments (e.g. start reading here, stop reading here) • introns – internal noncoding regions of the mRNA that are removed • exons – the part of the mRNA strand that contain the information to code for a protein strand. The pieces are “spliced” back together after the introns are removed Thus a cDNA strand that has been built (reverse engineered) on the mRNA exon sequence only has part of the information in the chromosomal DNA strand that was the “original” template. Two Views of the Gene The Genome—Organizational Hierarchy Gene: Specific segments of DNA that control cell structure and function; the functional units of inheritance. A sequence of DNA bases usually codes for a polypeptide sequence of amino acids: 3 nucleotides  one amino acid. (or a start or stop reading signal) A sequence-level view of a gene the mRNA sequence of beta globin Depicted on the next page is the sequence of RNA bases in the transcript of the human beta-globin gene. This is the “message” that goes from the DNA strand in the nucleus (the information on the gene) to the ribosome (rRNA) in the cytoplasm. The letters in magenta are the two introns which are removed from the transcript in the maturation process resulting in mRNA. The letters in blue are the bases at either end of the introns (GU...AG) which are used as "endpoints" in the splicing process. Proteins Proteins are linear polymers (chains or strings) of amino acids joined by peptide bonds in a specific sequence. They carry out most functional activities in the cell; major classes of proteins include enzymes, hormones, receptors (binding sites for signaling molecules) and antibodies. 3D coordinates (e.g. determined by x-ray crystallography or NMR) are stored in databases 3D structure is visualized using programs such as MolScript & RasMol. Protein Structure Each protein chain is made up of a string of amino acids. Different amino acids have different properties – some attract each other, some repel, to give the higher level structures http://www.genome.gov/ Pages/Hyperion//DIR/VI P/Glossary/Illustration/pr otein.shtml http://www.yangene.com/images/protein1c.jpg Quaternary structure of Hemoglobin The hemoglobin molecule is composed of 4 subunits – individual amino acid (polypeptide) chains – plus a molecule of iron tucked inside http://www.uic.edu/classes/bios/bios100/lectf03am/hemoglobin.jpg Protein Synthesis Mutations Mutations ultimately derive from incorrect, un-repaired DNA replication Point mutations (Single Nucleotide Polymorphisms)--changes in a single base of the coding triplet. May or may not have an effect on amino acid coding (because of code redundancy). Frameshift mutations can occur – 1 base shifts ALL the codes Segmental mutation—larger scale changes in the sequence within a single chromosome (insertion, deletion, inversion, duplication of longer nucleotide sequences) Chromosome duplication, deletion (e.g. Downs syndrome) Sickle Cell Disease is a genetic disorder involving a change in a single DNA nucleotide. The result of the change in nucleotide => hemoglobin molecule (protein) is a change in the shape of the red blood cell. This affects the ability of the red blood cells to move through the blood vessels and, ultimately, blood flow is reduced and tissue damage results http://www.nlm.nih.gov/medlineplus/ency/imagepages/1212.htm A change in nucleotide sequence = change in mRNA = change in amino acid = change in shape of hemoglobin imiloa.wcc.hawaii.edu/.../present/ lcture17/sld022.htm Reading the Genetic Code The goal of the various genome projects has been to chart the sequence of nucleotides in the chromosomes of various organisms (humans, yeasts, fruit flies, mice, etc.) and to connect the genetic information with other biological information -- protein structure  function, physiological processes, diseases and drug treatment, etc. How do you sequence the genome? Basically, you need to cut up the chromosomal DNA into pieces that are short enough (e.g. 2K – 150K base pairs depending on method), determine the nucleotide sequence of each piece, and then figure out how to string the pieces back together in the proper order. This is aided by computer processing. http://nema.cap.ed.ac.uk/teaching/genomics/Genomics3.html Random Sequencing Strategy Randomly chunk the entire genome into pieces Make multiple copies of the pieces Sequence each piece Look for sequence overlap to put all the sequences in order (automated) Celera assembled 27 million sequence chunk records using the most powerful non-military computer in the world. There are still lots of gaps AND we don’t know what most of the sequences represent in terms of genes and their expression Expressed Sequence Tag It is possible to “reverse engineer” a gene by working backwards from the mRNA to a strand of DNA with the complementary base sequence (cDNA). A partial sequence derived from cDNA is called an Expressed Sequence Tag. It may or may not represent the complete original genetic message for a protein—it certainly does not represent the complete gene as it existed in the nuclear DNA (only exons are present). ESTs DO represent genes that are active in a particular cell at a particular time – as evidenced by high levels of mRNA production. ESTs can be used to identify genes because they will hybridize to (match up with) known DNA sequences. What about ESTs? If you are looking at liver cells, you will be able to study the genes that are active in the liver, though you are likely to end up with lots of ESTs for abundant messages and few if any copies of the rare messages Only a subset of all the genes are turned on in liver cells, And only a subset of the liver genes may be active in your sample. So you have to sequence a LOT of ESTs AND figure out how they fit together to approximate the genome of interest. The INFO in Bioinformatics Major organizations: Public/non-profit: NCBI, EBI, TIGR Private sector: Incyte Pharm, Millenium Pharm, Affymetrix Databases: sequences (e.g. genes, proteins, ESTs), structures, images, biological & medical info, publications, etc.—for single organisms or broad collections Software: database entry, searching, sequence alignment, pattern recognition, clustering, tree-building, mapping, visualization Computers & Biology—perfect together Collecting & processing signals detected by lab equipment Tracking samples and managing experiments (industrial strength) Storing, searching, retrieving data in public databases (e.g. Genbank, PubMed) Data mining in large data collections—looking for rules and patterns Annotation—assigning functional meaning to uncharacterized data and linking different data collections Simulation of biological systems at all levels (interacting proteins to interacting populations) Does all of this “count” as BIOINFORMATICS? DATABASE ISSUES Primary (archival) vs secondary (curated) Public access vs private/fee-based access Cross-organism vs single organism Flat file/relational/OODB Federated? Combined in a Data Warehouse? Annotation (metadata, verbal indexing of known or predicted information about the gene or protein) Vocabulary standards (or lack of them) GenBank GenBank® is the NIH genetic sequence database, an annotated collection of all publicly available DNA (and RNA) sequences. There were approximately 37,893,844,733 bases in 32,549,400 sequence records as of Feb 2004. (And 100 gigabases in Aug, 2005) GenBank is part of the International Nucleotide Sequence Database Collaboration, which is comprised of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at the NCBI (National Center for Biotechnology Information). These three organizations exchange data on a daily basis. GenBank Many journals require submission of sequence information to a database prior to publication so that an accession number may appear in the paper. Counterpart databases for proteins include the Protein Data Bank (sequence + 3D structures (X-ray, NMR)), the Protein Information Resource (protein sequences), and Swiss-Prot (curated protein sequences) GenBank is one of a number of interrelated public databases that support bioinformatics and related research. ENTREZ is the gateway. Growth of Genbank Sequence Analysis Similarity searches – matching your sequence to those in the database (e.g. using BLAST--set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA) Alignment and multi-alignment of sequences Detection of protein coding regions Statistical analyses based on linguistic approaches for the identification of functional elements such as promoters, splicing sites, etc. Prediction of secondary structures in nucleic acids and protein sequences Prediction of protein tertiary structure Molecular evolutionary studies. Sequence Similarity Genes may share high sequence similarity across their entire length. Genes may show sequence similarity that is limited to a certain region—some parts of a protein will be similar and other parts will be different. Genes may share similar motifs, meaning that they encode regions of similar amino acid sequence that aren't located right next to each other in the linear sequence of the protein. The sequence lying between these regions of similarity can be quite different. When they fold up, however, proteins sharing a motif form similar three-dimensional structures (for example, "zinc finger" or "leucine zipper" motifs). Bioinformatics Research Sequence By looking for genes in model organisms that are similar to a given human gene, researchers can learn about the protein the human gene encodes and search for drugs to block it. The MLH1 gene, which is associated with colon cancer in humans, is used in this example. What are Microarrays? Microarrays, or “gene chips” are essentially an orderly array of samples of cDNA (with known nucleotide sequence) on a glass or nylon membrane. By exposing the samples on the chip to an unknown DNA sequence, you may be able to identify the unknown because it binds to the known cDNA “probe.” The strength of binding is generally indicated by fluorescence of the cDNA probe spots that have been targeted by the unknown sample. Processing, storing, analyzing the data produced are a major concern of bioinformatics. With microarrays, it is possible to study the expression of 10K genes at a time! Applications of Microarrays http://www.gene-chips.com/ Gene discovery Disease diagnosis Drug discovery: Pharmacogenomics Toxicological research: Toxicogenomics See http://www.ebi.ac.uk/microarray/ for a discussion of the informatics of microarrays. Bioinformatics and Microarray Data Data management -- LIMS Database design Algorithms for mining See http://www.ebi.ac.uk/microarray/ for a discussion of the informatics of microarrays. What are scientists doing with these new data? [stolen from Mike Agostino] Going for the low hanging fruit I’ve been looking for genes related to my favorite gene I’ve been looking for the rest of this gene I know a disease maps to a particular place Going for the high altitude fruit Does the overall organization of the genome tell us anything about gene expression? What are the functions of all these genes? Are there clusters of genes that are significant? Are any genes missing? In Summary --Types of Data DNA   [stolen from Mike Agostino] Sequence  Genomic-BIG, largely uncharacterized  Genes—smaller pieces, highly characterized  cDNA—messenger RNA (mRNA) copies  ESTs—”expressed sequence tags”  SNPs—single nucleotide polymorphisms Mapping  Chromosomal location Protein   Sequence—determined or derived Structure—crystals or other methods reveal 3-D shape Expression  Where/when genes are active and how much Questions? www.nevtron.si/borderline/ old.html

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document