* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Bioinformatics
Epigenetics of human development wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Point mutation wikipedia , lookup
Genomic library wikipedia , lookup
Human genome wikipedia , lookup
Non-coding DNA wikipedia , lookup
Minimal genome wikipedia , lookup
Designer baby wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene expression profiling wikipedia , lookup
Genome (book) wikipedia , lookup
Metagenomics wikipedia , lookup
Public health genomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Microevolution wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Pathogenomics wikipedia , lookup
Genome evolution wikipedia , lookup
Helitron (biology) wikipedia , lookup
Mark Gerstein, Yale University gersteinlab.org/courses/452 (last edit in fall 2005) 1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu BIOINFORMATICS Introduction Biological Data + Computer Calculations 2 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Bioinformatics What is Bioinformatics? Cor e • One idea for a definition? Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale. • Bioinformatics is a practical discipline with many applications. 3 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu • (Molecular) Bio - informatics What is the Information? Molecular Biology as an Information Science DNA -> RNA -> Protein -> Phenotype -> DNA • Molecules Sequence, Structure, Function • Processes Mechanism, Specificity, Regulation • Central Paradigm for Bioinformatics Genomic Sequence Information -> mRNA (level) -> Protein Sequence -> Protein Structure -> Protein Function -> Phenotype • Large Amounts of Information Standardized Statistical •Most cellular functions are performed or facilitated by proteins. •Primary biocatalyst •Cofactor transport/storage •Mechanical motion/support •Genetic material •Information transfer (mRNA) •Protein synthesis (tRNA/mRNA) •Some catalytic activity •Immune protection •Control of growth/differentiation (idea from D Brutlag, Stanford, graphics from S Strobel) 4 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu • Central Dogma of Molecular Biology Molecular Biology Information - DNA Coding or Not? Parse into genes? 4 bases: AGCT ~1 K in a gene, ~2 M in genome ~3 Gb Human atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgca gcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatac atggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtg aaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatcca gcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattc ttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaact ggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgca ggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgt gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact gcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgca tcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacct gcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgtt gttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatc aaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacact gaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgca gacgctggtatcgcattaactgattctttcgttaaattggtatc . . . . . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaa caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtgg cgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgctt gctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgg gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact acaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaacc aatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtc ggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaa aaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg 5 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu • Raw DNA Sequence Molecular Biology Information: Protein Sequence ACDEFGHIKLMNPQRSTVWY but not BJOUXZ • Strings of ~300 aa in an average protein (in bacteria), ~200 aa in a domain • ~200 K known protein sequences d1dhfa_ d8dfr__ d4dfra_ d3dfr__ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF d1dhfa_ d8dfr__ d4dfra_ d3dfr__ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI ISLIAALAVDRVIGMENAMPW-NLPADLAWFKRNTLD--------KPVIMGRHTWESI TAFLWAQDRNGLIGKDGHLPW-HLPDDLHYFRAQTVG--------KIMVVGRRTYESF d1dhfa_ d8dfr__ d4dfra_ d3dfr__ VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP ---G-RPLPGRKNIILS-SQPGTDDRV-TWVKSVDEAIAACGDVP------EIMVIGGGRVYEQFLPKA ---PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLDQ----ELVIAGGAQIFTAFKDDV d1dhfa_ d8dfr__ d4dfra_ d3dfr__ -PEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP -PEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP -G---RPLPGRKNIILSSSQPGTDDRV-TWVKSVDEAIAACGDVPE-----.IMVIGGGRVYEQFLPKA -P--KRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLD----QELVIAGGAQIFTAFKDDV 6 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu • 20 letter alphabet Molecular Biology Information: Macromolecular Structure Almost all protein (RNA Adapted From D Soll Web Page, Right Hand Top Protein from M Levitt web page) 7 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu • DNA/RNA/Protein Molecular Biology Information: Protein Structure Details 200 residues/domain -> 200 CA atoms, separated by 3.8 A Avg. Residue is Leu: 4 backbone atoms + 4 sidechain atoms, 150 cubic A • => ~1500 xyz triplets (=8x200) per protein domain 10 K known domain, ~300 folds ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM 1 2 3 4 5 6 7 8 9 10 11 12 C O CH3 N CA C O CB OG N CA C ACE ACE ACE SER SER SER SER SER SER ARG ARG ARG 0 0 0 1 1 1 1 1 1 2 2 2 9.401 10.432 8.876 8.753 9.242 10.453 10.593 8.052 7.294 11.360 12.548 13.502 30.166 30.832 29.767 29.755 30.200 29.500 29.607 30.189 31.409 28.819 28.316 29.501 60.595 60.722 59.226 61.685 62.974 63.579 64.814 63.974 63.930 62.827 63.532 63.500 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 49.88 50.35 50.04 49.13 46.62 41.99 43.24 53.00 57.79 36.48 30.20 25.54 1GKY 1GKY 1GKY 1GKY 1GKY 1GKY 1GKY 1GKY 1GKY 1GKY 1GKY 1GKY 67 68 69 70 71 72 73 74 75 76 77 78 1444 1445 1446 1447 1448 1449 1450 CB CG CD CE NZ OXT LYS LYS LYS LYS LYS LYS LYS 186 186 186 186 186 186 186 13.836 12.422 11.531 11.452 10.735 16.887 22.263 22.452 21.198 20.402 21.104 23.841 57.567 58.180 58.185 56.860 55.811 56.647 1.00 1.00 1.00 1.00 1.00 1.00 55.06 53.45 49.88 48.15 48.41 62.94 1GKY1510 1GKY1511 1GKY1512 1GKY1513 1GKY1514 1GKY1515 1GKY1516 ... ATOM ATOM ATOM ATOM ATOM ATOM TER 8 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu • Statistics on Number of XYZ triplets Molecular Biology Information: Whole Genomes Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek, A., Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D., Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D., Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A., Venter, J. C. (1995). "Wholegenome random sequencing and assembly of Haemophilus influenzae rd." Genome sequence now Science 269: 496-512. accumulate so quickly that, (Picture adapted from TIGR website, in less than a week, a single http://www.tigr.org) laboratory can produce • Integrative Data more bits of data than 1995, HI (bacteria): 1.6 Mb & 1600 genes done Shakespeare managed in a 1997, yeast: 13 Mb & ~6000 genes for yeast lifetime, although the latter 1998, worm: ~100Mb with 19 K genes make better reading. Small, K. V., Fraser, C. M., Smith, H. O. & 1999: >30 completed genomes! 2003, human: 3 Gb & 100 K genes... -- G A Pekso, Nature 401: 115-116 (1999) 9 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu • The Revolution Driving Everything Bacteria, 1.6 Mb, ~1600 genes [Science 269: 496] 1997 Eukaryote, 13 Mb, ~6K genes [Nature 387: 1] Genomes highlight the Finiteness of the “Parts” in Biology 1998 real thing, Apr ‘00 Animal, ~100 Mb, ~20K genes [Science 282: 1945] 2000? Human, ~3 Gb, ~100K genes [???] ‘98 spoof 10 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu 1995 Young/Lander, Chips, Abs. Exp. Brown, marray, Rel. Exp. over Timecourse Also: SAGE; Samson and Church, Chips; Aebersold, Protein Expression Snyder, Transposons, Protein Exp. 11 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Gene Expression Datasets: the Transcriptome Yeast Expression Data in Academia: levels for all 6000 genes! Can only sequence genome once but can do an infinite variety of these array experiments at 10 time points, 6000 x 10 = 60K floats telling signal from background (courtesy of J Hager) 12 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Array Data Systematic Knockouts Winzeler, E. A., Shoemaker, D. D., Astromoff, A., Liang, H., Anderson, K., Andre, B., Bangham, R., Benito, R., Boeke, J. D., Bussey, H., Chu, A. M., Connelly, C., Davis, K., Dietrich, F., Dow, S. W., El Bakkoury, M., Foury, F., Friend, S. H., Gentalen, E., Giaever, G., Hegemann, J. H., Jones, T., Laub, M., Liao, H., Davis, R. W. & et al. (1999). Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285, 901-6 2 hybrids, linkage maps Hua, S. B., Luo, Y., Qiu, M., Chan, E., Zhou, H. & Zhu, L. (1998). Construction of a modular yeast twohybrid cDNA library from human EST clones for the human genome protein linkage map. Gene 215, 143-52 For yeast: 6000 x 6000 / 2 ~ 18M interactions 13 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Other WholeGenome Experiments • Information to understand genomes Metabolic Pathways (glycolysis), traditional biochemistry Regulatory Networks Whole Organisms Phylogeny, traditional zoology Environments, Habitats, ecology The Literature (MEDLINE) • The Future.... (Pathway drawing from P Karp’s EcoCyc, Phylogeny from S J Gould, Dinosaur in a Haystack) 14 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Molecular Biology Information: Other Integrative Data • (Molecular) Bio - informatics • One idea for a definition? Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale. • Bioinformatics is a practical discipline with many applications. 15 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu What is Bioinformatics? GenBank Growth Year 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 GenBank Data Base Pairs Sequences 680338 606 2274029 2427 3368765 4175 5204420 5700 9615371 9978 15514776 14584 23800000 20579 34762585 28791 49179285 39533 71947426 55627 101008486 78608 157152442 143492 217102462 215273 384939485 555694 651972984 1021211 1160300687 1765847 2008761784 2837897 3841163011 4864570 8604221980 7077491 16 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Large-scale Information: Internet Hosts As important as the increase in computer speed has been, the ability to store large amounts of information on computers is even more crucial (Internet picture adapted from D Brutlag, Stanford) Num. Protein Domain Structures Structures in PDB • Driving Force in Bioinformatics 1979 1981 1983 1985 1987 1989 1991 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1980 1993 1995 140 120 100 80 60 40 20 0 1985 1990 1995 CPU Instruction Time (ns) • CPU vs Disk & Net 17 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Large-scale Information: Explonential Growth of Data Matched by Development of Computer Technology 3000 2500 2000 Per Year Cumulative 1500 1000 500 0 1998 2000 2002 2004 18 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Number of Papers PubMed publications with title “microarray” Features per chip transistors oligo features 19 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Features per Slide (courtesy of Finn Drablos) 20 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Bioinformatics is born! 21 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Weber Cartoon • (Molecular) Bio - informatics • One idea for a definition? Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale. • Bioinformatics is a practical discipline with many applications. 22 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu What is Bioinformatics? • Different Sequences Have the Same Structure • Organism has many similar genes • Single Gene May Have Multiple Functions • Genes are grouped into Pathways • Genomic Sequence Redundancy due to the Genetic Code • How do we find the similarities?..... Cor e Integrative Genomics genes structures functions pathways expression levels regulatory systems …. 23 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Organizing Molecular Biology Information: Redundancy and Multiplicity 24 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Molecular Parts = Conserved Domains, Folds, &c Extra 25 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu A Parts List Approach to Bike Maintenance How many roles can these play? How flexible and adaptable are they mechanically? What are the shared parts (bolt, nut, washer, spring, bearing), unique parts (cogs, levers)? What are the common parts - types of parts (nuts & washers)? Extra Where are the parts located? 26 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu A Parts List Approach to Bike Maintenance Total in Databank New Submissions New Folds 27 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Vast Growth in (Structural) Data... but number of Fundamentally New (Fold) Parts Not Increasing that Fast World of Structures is even more Finite, providing a valuable simplification 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 … ~100000 genes ~1000 folds (human) (T. pallidum) 1 2 3 4 5 6 7 8 9 10 11 Same logic for pathways, functions, sequence families, blocks, motifs.... Global Surveys of a Finite Set of Parts from Many Perspectives Functions picture from www.fruitfly.org/~suzi (Ashburner); Pathways picture from, ecocyc.pangeasystems.com/ecocyc (Karp, Riley). Related resources: COGS, ProDom, Pfam, Blocks, Domo, WIT, CATH, Scop.... 12 13 14 15 … ~1000 genes 28 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu 1 • (Molecular) Bio - informatics • One idea for a definition? Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale. • Bioinformatics is a practical discipline with many applications. 29 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu What is Bioinformatics? • Databases Building, Querying Object DB • Text String Comparison Text Search 1D Alignment Significance Statistics Alta Vista, grep • Finding Patterns AI / Machine Learning Clustering Datamining • Geometry Robotics Graphics (Surfaces, Volumes) Comparison and 3D Matching (Vision, recognition) • Physical Simulation Newtonian Mechanics Electrostatics Numerical Algorithms Simulation 30 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu General Types of “Informatics” techniques in Bioinformatics Bioinformatics as New Paradigm for Scientific Computing Prediction based on physical principles EX: Exact Determination of Rocket Trajectory Emphasizes: Supercomputer, CPU • Biology Cor e Classifying information and discovering unexpected relationships EX: Gene Expression Network Emphasizes: networks, “federated” database 31 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu • Physics Vs. Chemical Understanding, Mechanism, Molecular Biology 32 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Statistical Analysis vs. Classical Physics Bioinformatics, Genomic Surveys • Finding Genes in Genomic DNA introns exons promotors • Characterizing Repeats in Genomic DNA Statistics Patterns • Duplications in the Genome Large scale genomic alignment 33 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Bioinformatics Topics -Genome Sequence non-exact string matching, gaps How to align two strings optimally via Dynamic Programming Local vs Global Alignment Suboptimal Alignment Hashing to increase speed (BLAST, FASTA) Amino acid substitution scoring matrices • Multiple Alignment and Consensus Patterns How to align more than one sequence and then fuse the result in a consensus representation Transitive Comparisons HMMs, Profiles Motifs Bioinformatics Topics -Protein Sequence • Scoring schemes and Matching statistics How to tell if a given alignment or match is statistically significant A P-value (or an e-value)? Score Distributions (extreme val. dist.) Low Complexity Sequences • Evolutionary Issues Rates of mutation and change 34 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu • Sequence Alignment • Secondary Structure “Prediction” via Propensities Neural Networks, Genetic Alg. Simple Statistics TM-helix finding Assessing Secondary Structure Prediction • Tertiary Structure Prediction Fold Recognition Threading Ab initio • Function Prediction Active site identification • Structure Prediction: Protein v RNA • Relation of Sequence Similarity to Structural Similarity 35 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Bioinformatics Topics -Sequence / Structure • Basic Protein Geometry and Least-Squares Fitting Distances, Angles, Axes, Rotations • Calculating a helix axis in 3D via fitting a line LSQ fit of 2 structures Molecular Graphics • Calculation of Volume and Surface How to represent a plane How to represent a solid How to calculate an area Docking and Drug Design as Surface Matching Packing Measurement • Structural Alignment Aligning sequences on the basis of 3D structure. DP does not converge, unlike sequences, what to do? Other Approaches: Distance Matrices, Hashing Fold Library 36 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Topics -- Structures Keys, Foreign Keys SQL, OODBMS, views, forms, transactions, reports, indexes Joining Tables, Normalization • Natural Join as "where" selection on cross product • Array Referencing (perl/dbm) Forms and Reports Cross-tabulation • Protein Units? What are the units of biological information? • sequence, structure • motifs, modules, domains How classified: folds, motions, pathways, functions? Topics -Databases • Clustering and Trees Basic clustering • UPGMA • single-linkage • multiple linkage Other Methods • Parsimony, Maximum likelihood Evolutionary implications • Visualization of Large Amounts of Information • The Bias Problem sequence weighting sampling 37 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu • Relational Database Concepts and how they interface with Biological Information • Expression Analysis Time Courses clustering Measuring differences Identifying Regulatory Regions • Large scale cross referencing of information • Function Classification and Orthologs • The Genomic vs. Singlemolecule Perspective • Genome Comparisons Ortholog Families, pathways Large-scale censuses Frequent Words Analysis Genome Annotation Trees from Genomes Identification of interacting proteins • Structural Genomics Folds in Genomes, shared & common folds Bulk Structure Prediction • Genome Trees 38 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Topics -- Genomics • Molecular Simulation Geometry -> Energy -> Forces Basic interactions, potential energy functions Electrostatics VDW Forces Bonds as Springs How structure changes over time? • How to measure the change in a vector (gradient) Molecular Dynamics & MC Energy Minimization • • • • Parameter Sets Number Density Poisson-Boltzman Equation Lattice Models and Simplification 39 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Topics -- Simulation 40 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Bioinformatics Spectrum • (Molecular) Bio - informatics • One idea for a definition? Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale. • Bioinformatics is a practical discipline with many applications. 41 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu What is Bioinformatics? Cor e • Understanding How Structures Bind Other Molecules (Function) • Designing Inhibitors • Docking, Structure Modeling (From left to right, figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps, and from Computational Chemistry Page at Cornell Theory Center). 42 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Major Application I: Designing Drugs 43 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Major Application II: Finding Homologs Cor e • Find Similar Ones in Different Organisms • Human vs. Mouse vs. Yeast Easier to do Expts. on latter! (Section from NCBI Disease Genes Database Reproduced Below.) Best Sequence Similarity Matches to Date Between Positionally Cloned Human Genes and S. cerevisiae Proteins Human Disease MIM # Human Gene GenBank BLASTX Acc# for P-value Human cDNA Yeast Gene GenBank Yeast Gene Acc# for Description Yeast cDNA Hereditary Non-polyposis Colon Cancer Hereditary Non-polyposis Colon Cancer Cystic Fibrosis Wilson Disease Glycerol Kinase Deficiency Bloom Syndrome Adrenoleukodystrophy, X-linked Ataxia Telangiectasia Amyotrophic Lateral Sclerosis Myotonic Dystrophy Lowe Syndrome Neurofibromatosis, Type 1 120436 120436 219700 277900 307030 210900 300100 208900 105400 160900 309000 162200 MSH2 MLH1 CFTR WND GK BLM ALD ATM SOD1 DM OCRL NF1 U03911 U07418 M28668 U11700 L13943 U39817 Z21876 U26455 K00065 L19268 M88162 M89914 9.2e-261 6.3e-196 1.3e-167 5.9e-161 1.8e-129 2.6e-119 3.4e-107 2.8e-90 2.0e-58 5.4e-53 1.2e-47 2.0e-46 MSH2 MLH1 YCF1 CCC2 GUT1 SGS1 PXA1 TEL1 SOD1 YPK1 YIL002C IRA2 M84170 U07187 L35237 L36317 X69049 U22341 U17065 U31331 J03279 M21307 Z47047 M33779 DNA repair protein DNA repair protein Metal resistance protein Probable copper transporter Glycerol kinase Helicase Peroxisomal ABC transporter PI3 kinase Superoxide dismutase Serine/threonine protein kinase Putative IPP-5-phosphatase Inhibitory regulator protein Choroideremia Diastrophic Dysplasia Lissencephaly Thomsen Disease Wilms Tumor Achondroplasia Menkes Syndrome 303100 222600 247200 160800 194070 100800 309400 CHM DTD LIS1 CLC1 WT1 FGFR3 MNK X78121 U14528 L13385 Z25884 X51630 M58051 X69208 2.1e-42 7.2e-38 1.7e-34 7.9e-31 1.1e-20 2.0e-18 2.1e-17 GDI1 SUL1 MET30 GEF1 FZF1 IPL1 CCC2 S69371 X82013 L26505 Z23117 X67787 U07163 L36317 GDP dissociation inhibitor Sulfate permease Methionine metabolism Voltage-gated chloride channel Sulphite resistance protein Serine/threoinine protein kinase Probable copper transporter 44 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Major Application II: Finding Homologues • • • • Cross-Referencing, one thing to another thing Sequence Comparison and Scoring Analogous Problems for Structure Comparison Comparison has two parts: (1) (2) Optimally Aligning 2 entities to get a Comparison Score Assessing Significance of this score in a given Context • Integrated Presentation Align Sequences Align Structures Score in a Uniform Framework 45 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Major Application II: Finding Homologues (cont.) • Overall Occurrence of a Certain Feature in the Genome e.g. how many kinases in Yeast • Compare Organisms and Tissues Expression levels in Cancerous vs Normal Tissues • Databases, Statistics (Clock figures, yeast v. Synechocystis, adapted from GeneQuiz Web Page, Sander Group, EBI) 46 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Cor Major Application I|I: e Overall Genome Characterization EX-2: Occurrence of 1-4 salt bridges in genomes of thermophiles v mesophiles 0.70 EK(3) EK(4) 0.60 0.50 LOD value EX-1: Occurrence of functions per fold & interactions per fold over all genomes 0.40 0.30 0.20 0.10 0.00 MP MG EC SC HP 10 to 45 Mesophile SS HI MT MJ AF AA OT 65 85 83 95 98 Thermophile Physiological temperature in C 47 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu What do you get from large-scale datamining? Global statistics on the population of proteins (Bioinfo-1) [next class joins intro & seqs.] 48 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu End of class 2005,10.23 • (Molecular) Bio - informatics • One idea for a definition? Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale. • Bioinformatics is a practical discipline with many applications. 49 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu What is Bioinformatics? • Digital Libraries Automated Bibliographic Search of the biological literature and Textual Comparison Knowledge bases for biological literature • Motif Discovery Using Gibb's Sampling • Methods for Structure Determination Computational Crystallography • Refinement NMR Structure Determination • Distance Geometry • Metabolic Pathway Simulation • The DNA Computer 50 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Are They or Aren’t They Bioinformatics? (#1) • (YES?) Digital Libraries Automated Bibliographic Search and Textual Comparison Knowledge bases for biological literature • (YES) Motif Discovery Using Gibb's Sampling • (NO?) Methods for Structure Determination Computational Crystallography • Refinement NMR Structure Determination • (YES) Distance Geometry • (YES) Metabolic Pathway Simulation • (NO) The DNA Computer 51 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Are They or Aren’t They Bioinformatics? (#1, Answers) • Gene identification by sequence inspection Prediction of splice sites • DNA methods in forensics • Modeling of Populations of Organisms Ecological Modeling • Genomic Sequencing Methods Assembling Contigs Physical and genetic mapping • Linkage Analysis Linking specific genes to various traits 52 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Are They or Aren’t They Bioinformatics? (#2) • (YES) Gene identification by sequence inspection Prediction of splice sites • (YES) DNA methods in forensics • (NO) Modeling of Populations of Organisms Ecological Modeling • (NO?) Genomic Sequencing Methods Assembling Contigs Physical and genetic mapping • (YES) Linkage Analysis Linking specific genes to various traits 53 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Are They or Aren’t They Bioinformatics? (#2, Answers) • RNA structure prediction Identification in sequences • Radiological Image Processing Computational Representations for Human Anatomy (visible human) • Artificial Life Simulations Artificial Immunology / Computer Security Genetic Algorithms in molecular biology • Homology modeling • Determination of Phylogenies Based on Nonmolecular Organism Characteristics • Computerized Diagnosis based on Genetic Analysis (Pedigrees) 54 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Are They or Aren’t They Bioinformatics? (#3) • (YES) RNA structure prediction Identification in sequences • (NO) Radiological Image Processing Computational Representations for Human Anatomy (visible human) • (NO) Artificial Life Simulations Artificial Immunology / Computer Security (NO?) Genetic Algorithms in molecular biology • (YES) Homology modeling • (NO) Determination of Phylogenies Based on Nonmolecular Organism Characteristics • (NO) Computerized Diagnosis based on Genetic Analysis (Pedigrees) 55 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Are They or Aren’t They Bioinformatics? (#3, Answers) • Issues that were uncovered Does topic stand alone? Is bioinformatics acting as tool? How does it relate to lab work? • Relationship to other disciplines Medical informatics Synthetic biology Systems biology • Biological question is important, not the specific technique -- but it has to be computational Using computers to understand biology vs using biology to inspire computation • Some new ones (2005) Disease modeling [are you modeling molecules?] Enzymology (kinetics and rates?) [is it a simulation or is it interpreting 1 expt.? ] Genetic algs used in gene finding HMMs used in gene finding • vs. Genetic algs used in speech recognition HMMs used in speech recognition Semantic web used for representing biological information 56 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Further Thoughts in 2005 on the "Boundary of Bioinformatics"