Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bioinformatics Research Centre University of Glasgow David Gilbert www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow David Gilbert: [email protected] BRC Glasgow 1 Bioinformatics •Bio - Molecular Biology •Informatics - Computer Science •Bioinformatics - the study of the application of - molecular biology, computer science, artificial intelligence, statistics and mathematics - to model, organise, understand and discover interesting information associated with the large scale molecular biology databases, - to guide assays for biological experiments. David Gilbert: [email protected] (Computational Biology - USA). BRC Glasgow 2 Bioinformatics in context a new discipline? Computing Maths & Stats ?Psychology? Physical Sciences Life sciences David Gilbert: [email protected] BRC Glasgow 3 Bioinformatics in context (applications) David Gilbert: [email protected] BRC Glasgow 4 How can we analyse the flood of data ? Data: don't just store it, analyze it ! By comparing sequences, one can find out about things like • How organisms are related & evolution • How proteins function • Population variability • How diseases occur David Gilbert: [email protected] BRC Glasgow 5 Separating sheep from goats... David Gilbert: [email protected] BRC Glasgow 6 Dirty data? Big Horn Sheep [Ovis canadensis] The Big Horn Sheep [Ovis canadensis] is a large North American species with a brown coat, which turns to bluish-grey in winter. It is so named from the size of the horns of the ram, which often measure over 1 m/3.3 ft round the curve. Classification: David Gilbert: Ovis canadensis is in family Bovidae, order Artiodactyla [email protected] BRC Glasgow 7 Data, information, knowledge … • data : nucleotide sequence • information : where are the “genes”. control statement Termination (stop) TATA box control statement start gene Found using classifier, pattern, rule which has been mined/discovered • knowledge : facts and rules If a gene X has a weak psi-blast assignment to a function F –and that gene is in an expression cluster –and sufficient members of that cluster are known to have function F, then believe assignment of F to X. David Gilbert: [email protected] BRC Glasgow 8 Some projects at the Bioinformatics Research Centre David Gilbert: [email protected] BRC Glasgow 9 David Gilbert: [email protected] BRC Glasgow 10 Rat-Mouse-Human David Gilbert: [email protected] BRC Glasgow 11 Indexing Ela Hunt [email protected] • String indexing structures can be used to index DNA, proteins, XML and phylogenetic trees • All data is read once, index in created on disk • Index reduces the search space of the query (we read a % of disk only) David Gilbert: [email protected] BRC Glasgow 12 Distributed databases and computation Cardiovascular Functional Genomics • -£5.4 million project, 5 UK Universities: Glasgow, Leicester, Edinburgh, Oxford, Imperial; + Maastricht • Led by Clinicians • Combined studies: – scientific models of disease (Rat) – parallel studies of patients – large family and population DNA collections • 3 pronged approach – Targeted transcript sequencing – Microarray gene expression profiling – Comparative genome analysis. • Data generated at each of the 5 sites & made available for analysis: • Issues of distributed data and computation. • Mapping gene sequences Rat Mouse Human – an added layer of complexity in the computation. David Gilbert: [email protected] BRC Glasgow 13 Wellcome Trust: Cardiovascular Functional Genomics Glasgow Shared data Edinburgh Public curated data Leicester Oxford London David Gilbert: [email protected] BRC Glasgow Netherlands 14 BRIDGES: BioMedical Research Informatics Delivered by Grid Enabled Services • National e-Science Centre, Bioinformatics Research Centre, IBM UK Life Sciences • Incrementally develop and explore database integration over 6 geographically distributed research sites within the framework of the large Wellcome Trust biomedical research project Cardiovascular Functional Genomics. • Three classes of integration will be developed to support a sophisticated bioinformatics infrastructure supporting: – data sources (both public and project generated), – bioinformatics analysis and visualisation tools, – research activities combining shared and private data. • The inclusion of patient records and animal experiment data means that privacy and access control are particular concerns. • An exploration of index factories accelerating sequence processing will test the hypothesis that the Grid makes a new class of e-Science indexes feasible. Both OGSA-DAI and IBM DiscoveryLink technology will be employed and a report will identify how each performed in this context. David Gilbert: [email protected] BRC Glasgow 15 Functional Genomics ~44,000 GENES David Gilbert: [email protected] ~33% OF GENES HAVE UNKNOWN FUNCTION BRC Glasgow 16 Ali Al-Shahib Chao He, Mark Girolami Solution…… • Solve the problem of the twilight zone (sequence alignments below 30% sequence identity) • How? • Predict protein function using an alternative method to BLAST: • Predict protein functional class from sequence, structural and phylogenetic features using machine learning • Combination of these (computationally and statistically) would provide the biologists like yourselves with the most accurate functional prediction of proteins that fall in the twilight zone. David Gilbert: [email protected] BRC Glasgow 17 Molecular Evolution: A Phylogenetic Approach Rod Page [email protected] Human gene duplication Locating genome duplications Q: did one or more genome-wide events affect all gene families? Human Mouse Reptiles + Birds Human Lungfish Mouse Lamprey David Gilbert: [email protected] Mouse happened somewhere here BRC Glasgow Teleosts Sharks & Rays Lamprey 18 TOPS Protein topology David Gilbert, Juris Viksna, Gilleain Torrance (BRC, Glasgow), David Westhead and Ioannis Michalopoulos (Leeds) BBSRC/EPSRC funded David Gilbert: [email protected] BRC Glasgow 19 Pattern search: TIM Barrel David Gilbert: [email protected] BRC Glasgow 20 Structure comparison 2bop (probe) against (subset of) CATH David Gilbert: [email protected] BRC Glasgow 21 TOPS comparison server: www.tops.leeds.ac.uk PDB file TOPS diagram (graph) (v.fast) (slower) Pairwise comparison to structures in database Matches to motif library David Gilbert: [email protected] BRC Glasgow 22 Protein design Design of a Novel Globular Protein Fold with Atomic-Level Accuracy Brian Kuhlman,1 Gautam Dantas,1 Gregory C. Ireton,4 Gabriele Varani,1,2 Barry L. Stoddard,4 David Baker1,3 “A major challenge of computational protein design is the creation of novel proteins with arbitrarily chosen three-dimensional structures. Here, we used a general computational strategy that iterates between sequence design and structure prediction to design a 93-residue /ß protein called Top7 with a novel sequence and topology. Top7 was found experimentally to be folded and extremely stable, and the x-ray crystal structure of Top7 is similar (root mean square deviation equals 1.2 angstroms) to the design model. The ability to design a new protein fold makes possible the exploration of the large regions of the protein universe not yet observed in nature.” 1 Department of Biochemistry, University of Washington, Seattle, WA 98195, USA. 2 Department of Chemistry, University of Washington, Seattle, WA 98195, USA. 3 Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA. 4 Division of Basic Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, WA 98109, USA David Gilbert: [email protected] Science. 2003 Nov 21;302(5649):1364-8 BRC Glasgow 23 Protein design Generation of starting models. “The target structure for the de novo design process can range from a detailed backbone model to a back-of-the-envelope sketch.” “Because we aimed to create a novel protein fold, we selected a topology not present in the PDB according to the Topology of Protein Structure (TOPS) server (17).” David Gilbert: [email protected] BRC Glasgow 24 Use of TOPS for protein design User = [email protected] Submitted at 20:29:51 on 3/06/03 Structure code = top7a type = PDB (user declared), Database = atlas Details of sheets etc (including all connected SSEs): Sheet: [6,7,4,1,2] ====================================================== Domain Code Rank Comparison time : 43 sec top7a target_query 0 1bbi00 4.10.100.10.1 7 1pi200 4.10.100.10.1 7 1sro00 2.40.29.10.1 7 1atx00 2.20.20.10.1 9 2sh100 2.20.20.10.1 9 1vcc00 3.30.66.10.1 11 1hpm02 3.10.140.10.1 12 1csp00 2.40.50.40.1 13 2snv01 2.40.10.20.3 13 3tss02 2.40.50.50.3 13 1bcpF0 2.40.50.50.2 14 1bovA0 2.40.50.30.2 14 1tle00 2.10.25.10.1 14 1cdb00 2.60.40.10.1 15 1ckmA3 4.10.87.10.1 15 1kxf01 2.40.10.20.3 15 1svpA1 2.40.10.20.3 15 2pkaX0 2.40.10.20.1 15 1apo00 2.10.25.10.6 16 NEEheEC 1:2A 1ate00 2.10.40.10.1 16 1aww00 2.30.30.10.1 16 1cuk01 2.40.50.80.1 16 David Gilbert: [email protected] Top7a 1:4A 2:4R 4:6R 4:7A 6:7A 1:4R 4:6R BRC Glasgow 25 Use of TOPS for protein design David Gilbert: [email protected] BRC Glasgow 26 Systems biology – some definitions • Systems biology is the study of all the elements in a biological system (all genes, mRNAs, proteins, etc) and their relationships one to another in response to perturbations. • Systems approaches attempt to study the behaviour of all of the elements in a system and relate these behaviours to the systems or emergent properties David Gilbert: [email protected] BRC Glasgow 27 A Framework for Systems Biology (Ideker, Galitski & Hood, 2001) • Define all of the components of the system • Systematically perturb and monitor components of the system • Reconcile the experimentally observed responses with those predicted by the model • Design and perform new perturbation experiments to distinguish between multiple or competing model hypotheses David Gilbert: [email protected] BRC Glasgow 28 New database technologies for storing the output from highthroughput biological experiments Andrew Jones • • • • • Proteomics – study the set of proteins expressed in a sample Complex, variable output: • High-Resolution images • Numerical data generated by lab. equipment and software • Human Annotation The data is not suitable for storage in a standard relational database Storage, retrieval and exchange of data is important XML (Extensible Markup Language) is being investigated for storing such data David Gilbert: [email protected] BRC Glasgow 29 • Maintained by National Library of Medicine • Free of charge, since 1997 • > 10 million references since 1971 • > 4000 biomedical journals • > 80% in English • > 80% have an abstract "Biochemical Network Data Mined from Scientific Texts" Te Ren (PhD student) with CXR Biosciences. David Gilbert: [email protected] BRC Glasgow 30 Data complexity Methionine Biosynthesis in E.coli L-aspartate aspartate biosynth. aspartate kinase II/homoserine dehydrogenase II 2.7.2.4 catalyzes expression codes for ATP ADP metBL operon aspartate semialdehyde deshydrogenase asd metL expression codes for catalyzes 1.2.1.11 L-Aspartate-4-P NADPH; H+ NADP+; Pi L-Aspartate semialdehyde lysine biosynth. metB catalyzes metA represses 1.1.1.3 homoserine-O-succinyltransferase expression codes for catalyzes NADPH;H+ NADP+ L-Homoserine threonine biosynth. Succinyl SCoA 2.3.1.46 HSCoA represses represses Holorepressor cystathionine-gamma-synthase expression codes for catalyzes 4.2.99.9 aplha-succinyl-L-Homoserine L-Cysteine Succinate is part of represses cystathionine-beta-lyase metC expression codes for Aporepressor 4.4.1.8 catalyzes Cystathionine H2O Pyruvate; NH4+ represses Homocysteine metE Cobalamin-independent homocysteine transmethylase expression codes for expression codes for represses expression codes for metH expression codes for metR 2.1.1.13 2.1.1.14 up-regulates metJ 5-Methyl THF catalyzes THF catalyzes Cobalamin-dependent homocysteine transmethylase inhibits L-Methionine metR activator 2.5.1.6 ATP Pi; PPi is part of inhibits L-Adenosyl-L-Methionine Biochemical networks DNA chip experiment Transcription profiles • Pathway navigation • Pathway comparison • Pathway motif discovery • Pathway simulation Visualization Clustering Clusters of co-regulated genes Functional meaning ? Pathway extraction in metabolic reaction graph Putative metabolic pathways Matching against metabolic pathway database Known pathways • High-level abstraction inferred from low-level descriptions • Novel pathways from gene expression experiments Novel pathways L-Aspartate 2.7.2.4 L-aspartyl-4-P A Software System for Pattern Matching and Motif Discovery in Biochemical Networks Sebastian Oehm [email protected] 1.2.1.11 L-aspartic semialdehyde L-Aspartate 2.7.2.4 L-aspartyl-4-P 1.2.1.11 L-aspartic semialdehyde 1.1.1.3 1.1.1.3 L-Homoserine L-Homoserine 2.3.1.31 O-acetyl-homoserine 4.2.99.10 Homocysteine 2.1.1.14 L-Methionine • Design a suitable data model using bipartite graphs • Define patterns and develop algorithms for pattern matching in biochemical networks • Define pathway motifs and develop algorithms for motif searching in biochemical networks • Develop algorithms for automated motif discovery • Develop algorithms to search for the largest common part of two or more biochemical networks • Develop a measure of similarity for pathway comparison 2.3.1.46 Alpha-succinyl-LHomoserine 4.2.99.9 Cystathionine 4.4.1.8 Homocysteine 2.1.1.14 L-Methionine 2.5.1.6 2.5.1.6 S-Adenosyl-L-Methionine S-Adenosyl-L-Methionine David Gilbert: S.cerevisiae [email protected] BRC Glasgow E.coli 33 Biochemical Pathway Simulator A Software Tool for Simulation & Analysis of Biochemical Networks DTI ‘Beacon’ project, £0.9M, 4 years Muffy Calder David Gilbert Walter Kolch Keith van Rijsbergen Brian Ross Oliver Sturm David Gilbert: [email protected] BRC Glasgow 34 Not a toy problem! Experimental Data David Gilbert: [email protected] Analysis BRC Glasgow 35 Complexity: real bioinformatics Closing the loop from wet lab to in-silico Mitogens Growth factors Abstract model Receptor receptor e Ras n ki P P Raf as P P P MEK P P ERK cytoplasmic substrates Elk SAP Gene Human feedback (in-the-loop) Simulator DATA Analysis Pathway Editor Literature Apoptosis Rules Database Apoptosis Text miner Simulator Concurrency theory Bio Lab/Literature David Gilbert: [email protected] Bioinformatics Tools, database, interface BRC Glasgow 36 Web portal Lab MAPK User Interface Database MAPK Proliferation (Cell division) vs Differentiation (Neurite out in PC12 cell model NGF (50 ng/ml) Differentiation into nerve cell type EGF (50 ng/ml) Proliferation cell division stimulated without neurite outgrowth David Gilbert: [email protected] BRC Glasgow neurite outgrowth 37 Dynamic Behaviour of the Network Receptor Receptor Ras Receptor cAMP Ras cAMP PKA Ras PKA Raf-1 Raf-1 B-Raf Raf-1 MEK1,2 MEK1,2 MEK1,2 ERK1,2 ERK1,2 ERK1,2 Cell growth Raf-1 is expressed in all cells, and its activation induces ERK activation David Gilbert: [email protected] Growth arrest Many receptors that activate ERK also elevate cAMP levels leading to activation of PKA. PKA inhibits Raf-1 and blocks ERK activation BRC Glasgow Cell growth However, cAMP induces activation of B-raf. In cells which express B-raf, cAMP activates the ERK pathway despite of Raf-1 inhibition. 38 David Gilbert: [email protected] BRC Glasgow 39 Mobility Sometimes a signal sent in a communications network can change the connections or topology of that network. In the example below, a cell-phone is being carried out of range of Cell 1. The base station must send the frequency of the appropriate new Cell (Cell 2) to the phone. The phone connects to Cell 2 and discards its previous link to Cell 1. Frequency Cell 2 Conversation Conversation Cell 1 Cell 2 Frequency Cell 2 Conversation Conversation Base Base David Gilbert: [email protected] BRC Glasgow 40 In biochemical networks, a protein can be granted or denied the opportunity to interact with certain other molecules by exchange factors, effectively changing the network topology dynamically. In the example below, the protein Ras is bound to a molecule of GDP, which renders Ras inactive. A molecule of SoS can interact with this Ras-GDP complex, causing the GDP to be exchanged for GTP. The Ras-GTP complex is active, permitting interaction with the protein Raf. Ras Ras Raf GDP GTP SoS GDP GTP David Gilbert: [email protected] BRC Glasgow 41 Reusable Subcomponents of a Solution for Offline Integration of 3rd party Databases Integrator Extracted Lit. Data Schema Translator Record Matcher Integrated Database Record Merger aMaze DB MAPK source data cAMP PK source data Input Schemas David Gilbert: [email protected] Trans Local Schemas • Record Matching Rules Default Values Cross-ref Index Conflict Resolution Rules Target Schema By-products of the total process may correspond to other reusable sub-services – Schema Translation – various schema definition langs are translated into one common, interpretable schema lang. – Record Matching – builds a cross reference index that identifies records about a “same entity” and records the source and location of the matching records. Two or more records may match. BRC Glasgow 42 Validation Current Bottlenecks in Drug Development Drug target discovery: What is a good drug target? How do we select it? Drug target validation: Does hitting the target change the biological response? Side effects: What else is affected when the selected target is hit? Lead Compound Selection: Which compounds should be taken further for development. What properties should the drug have? David Gilbert: [email protected] BRC Glasgow 43 Validation Current Bottlenecks in Drug Development Drug target discovery: What is a good drug target? How do we select it? Drug target validation: Does hitting the target change the biological response? Side effects: What else is affected when the selected target is hit? Lead Compound Selection: Which compounds should be taken further for development. What properties should the drug have? David Gilbert: [email protected] BRC Glasgow 44 Validation Current Bottlenecks in Drug Development A robust Pathway Simulation Software can help to … Drug target discovery: What is a good drug target? How do we select it? Select targets by defining its topology & function in the regulatory networks. Drug target validation: Does hitting the target change the biological response? Validate the target by predicting how the biological response should change. Side effects: What else is affected when the selected target is hit? Predict side effects to allow early and targeted testing. Lead Compound Selection: Which compounds should be taken further for development. What properties should the drug have? Predict the optimal drug profile to improve selection criteria. David Gilbert: [email protected] BRC Glasgow 45 Validation What we propose … PC12 cell model of neuronal differentiation EGF Ras Transient ERK activity Raf-1 MEK NGF Rap proliferation ERK B-raf Sustained ERK activity differentiation Target Validation: Predict & test the effect of Raf-1 and B-Raf inhibitors to the biological response to EGF vs. NGF. Lead Compound Selection: Predict & test which inhibitory efficacy is necessary and sufficient to achieve the desired biological response. David Gilbert: [email protected] BRC Glasgow 46 Bionanotechnology & Bioinformatics Nanofab & cell culture Fab methodology Physical substrate Measured cell behaviour Dynamic behaviour Model of cell behaviour Biochemical environment (other cells + biochemicals) Morphology Adhesion Cell shape Gene expression Bioinformatics Genetic engineering External databases David Gilbert: [email protected] Proteome Other pathway data BRC Glasgow 47 Machine Learning for Bioinformatics • Classification • Clustering • Characterisation • Techniques: – – – – – – ensemble methods decision trees inductive logic programming pattern discovery Statistical approaches SVMs David Gilbert: [email protected] BRC Glasgow 48 Cancer Classification Problem (Golub et al 1999) ALL acute lymphoblastic leukemia (lymphoid precursors) David Gilbert: [email protected] BRC Glasgow AML acute myeloid leukemia (myeloid precursor) 49 Machine Learning Approach Machine Learning Gene Expression Profiles David Gilbert: [email protected] C4.5 SVM k-NN ANN BRC Glasgow Classifier ALL AML ALL AML 50 Biological Data: Distributed and Heterogeneous!! Protein Sequence LPSYVDWRSA ECGGCWAFSA TSGSLISLSE NTRGCDGGYI GGINTEENYP Structure Function GAVVDIKSQG IATVEGINKI QELIDCGRTQ TDGFQFIIND YTAQDGDCDV Microarray analysis Gene expression David Gilbert: [email protected] Morphology BRC Glasgow 51 Integrative Machine Learning Aik Choon Tan (Pratt Emotif) David Gilbert: [email protected] BRC Glasgow 52 What kind of computational approaches do we use? • Operations over – sequences (match) – trees (e.g. suffix trees, supertree, joining, ...) – graphs (sub-graph isomorphism, maximal common subgraph, path searching) • Data modelling, databases, data conversion • Machine learning, knowledge discovery, pattern discovery,... • Clustering • Theorem proving, concurrency analysis,… • Integration: data, knowledge • Data visualisation • Web services, Grid, Coarse Grain parallelism, eScience,... David Gilbert: [email protected] BRC Glasgow 53 Latest from BRC • New Systems Biology lab (March 9) • Web services, www.brc.dcs.gla.ac.uk • Research teams: Databases & Visualisation Grid & eScience Functional genomics Machine learning Structural bioinformatics Systems biology (Ela Hunt) (Richard Sinnott) (David Leader) (Mark Girolami) (Pawel Herzyk) (David Gilbert) • Teaching: MScIT Bioinformatics Strand David Gilbert: [email protected] BRC Glasgow 54 BRC members • • • • • Investigators: – Yves Deville (Biochemical Networks) dcs – David Gilbert (Systems biololgy, Protein structure) dcs – Mark Girolomi (Machine learning) dcs – Pawel Herzyk (Protein structure) ibls – Ela Hunt (Database indexing, Data integration, Visualisation,…) dcs – David Leader (Visualisation tools) ibls – Gerhard May (Signalling pathways) ibls – Rod Page (Phylogenetic trees) ibls – Richard Sinnott (Grid computing / eScience) dcs – Juris Viksna (Graph algorithms) dcs Research Assistants: Micha Bayer, Rainer Breitling, Neil Hanlon, Derek Houghton, Richard Orton, Evangelos Pafilis, Oliver Sturm, Gilleain Torrance Research students: Ali Al-Shahib, David Cook, Iain Darroch, Amelie Gormand, Susan Fairley, Robert Japp, Andrew Jones, Julie Morrison, Te Ren, Aik Choon Tan, Tim Troup, Mallika Veeramalai Executive Assistant: Margaret Jackson Associated: Malcolm Atkinson, Ernst Wit, John McClure, Mathis Riehle, Des Higham, Oliver Sand David Gilbert: [email protected] BRC Glasgow 55 Funding sources EPSRC BBSRC MRC Wellcome Trust DTI Scottish Enterprise Synergy Carnegie Trust Royal Society Daiwa Foundation SHEFCE EU David Gilbert: [email protected] BRC Glasgow 56 Scottish Bioinformatics Forum • Network of Bioinformatics researchers and industries in Scotland • A vehicle for developing Scotland as a Centre of Bioinformatics Excellence • Nodes in Glasgow, Edinburgh, Dundee, Aberdeen, ... • Promoting collaborative research • Development of a Bioinformatics educational programme • www.sbforum.org, [email protected] Visionary Meeting, 27 May (Zoology Building) Keynote : Prof Thornton Director of the European Bioinformatics Centre www.brc.dcs.gla.ac.uk/events.html David Gilbert: [email protected] BRC Glasgow 57 David Gilbert: [email protected] BRC Glasgow 58 Sun GridEngine Bioinformatics Research Centre Davidson Building: 15 workstations + visitors’ facilities File Database Unix App server server server firewall Web server 1TB Microsoft App server Cluster Scotgrid+ 2x100 CPU 5 TB 3TB 17 Lilybank Gardens Kelvin Building Boyd-Orr Building (backup) David Gilbert: [email protected] BRC Glasgow 59 www.brc.dcs.gla.ac.uk David Gilbert: [email protected] BRC Glasgow 60 Where we are Vet School Beatson Institute Department of Computing Science BRC Functional & Functional Genomics; Genomics Centre(Joseph for CellBlack) Engineering NeSC Hub Medicine & Theraputics David Gilbert: [email protected] BRC (in Davidson Building) BRC Glasgow 61 BRC location David Gilbert: [email protected] BRC Glasgow 62 Bioinformatics Research centre (230m2) Gardiner lab (wet lab) Visitors’area Visitors’area David Gilbert: [email protected] BRC Glasgow 63 The Future Closing the loop from wet lab to in silico ! Collaboration! www.brc.dcs.gla.ac.uk David Gilbert: [email protected] BRC Glasgow 64