* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Biological databases play a central role in bioinformatics.
Histone acetylation and deacetylation wikipedia , lookup
X-inactivation wikipedia , lookup
Magnesium transporter wikipedia , lookup
History of molecular evolution wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Genome evolution wikipedia , lookup
Gene desert wikipedia , lookup
Point mutation wikipedia , lookup
Community fingerprinting wikipedia , lookup
Molecular ecology wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Gene expression wikipedia , lookup
Protein moonlighting wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene expression profiling wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Molecular evolution wikipedia , lookup
Gene regulatory network wikipedia , lookup
Biological databases play a central role in bioinformatics. They offer scientists the opportunity to access sequence and structure data for tens of thousands of sequences from a broad range of organisms. We will see GO, CONSURF, PFAM The Gene Ontology (GO) Project: Structured Vocabularies for Molecular Biology and Their Application to Genome and Expression Analysis The focus of the Gene Ontology project is three-fold. First, the project goal is to compile the Gene Ontologies; structured vocabularies describing domains of molecular biology. The three domains under development were chosen as ones that are shared by all organisms; Molecular Function, Biological Process, and Cellular Component. Second, the project supports the use of these structured vocabularies in the annotation of gene products. Gene products are associated with the most precise GO term supported by the experimental evidence. Structured vocabularies are hierarchical, allowing both attributions and queries to be made at different levels of specificity. Third, the gene product-to-GO annotation sets are provided by participating groups to the public through open access to the GO database and Web resource. Thus, the community can access standardized annotations of gene products across multiple species and resources. 1 We will describe the current ontologies and what is beyond the scope scope of the Gene Ontology project. It addresses the issue of how GO vocabularies are constructed and related to genes and gene products. It concludes with a discussion of how researchers can access, browse, and utilize the GO project in the course of their their own research. What are Ontologies and Why do we Need Them? Ontologies, in one sense used today in the fields of computer science science and bioinformatics, are “specifications of a relational vocabulary” Three areas are considered orthogonal to each other, i.e., they are treated as independent domains. The ontologies are developed to include all terms falling into these domains without consideration consideration of whether the biological attribute is restricted to certain taxonomic taxonomic groups. Therefore, biological processes that occur only in plants plants (e.g., photosynthesis) or mammals (e.g., lactation) are included. included. 2 How does GO work? What information might we want to capture about a gene product? What does the gene product do? Why does it perform these activities? Where does it act? GO: Three ontologies What does it do? Molecular Function What processes is it involved in? Biological Process Where does it act? Cellular Component gene product 3 The 3 Gene Ontologies Molecular Function = elemental activity/task – the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity Biological Process = biological goal or objective – broad biological goals, such as mitosis or purine metabolism, metabolism, that are accomplished by ordered assemblies of molecular functions Cellular Component = location or complex – subcellular structures, locations, and macromolecular complexes; examples include nucleus, nucleus, telomere, telomere, and RNA polymerase II holoenzyme Molecular Function Molecular Function refers to the elemental activity or task performed, or potentially performed, by individual gene products. products. Enzymatic activities such as “nuclease,” as well as structural activities such as “structural constituent of chromatin” are included included in Molecular function. An example of a broad functional term is “transporter” (enabling the directed movement of substances, such as macromolecules, small molecules, and ions, into, out of, or within a cell). An example of a more detailed functional term is “protein“protein-glutamine gammagamma-glutamyltransferase,” which crosscross-links adjacent polypeptide chains by the formation of the N6N6-(L(L-isoglutamyl)isoglutamyl)-L-lysine isopeptide; the gammagamma-carboxymide groups of peptidepeptide-bound glutamine residues act as acyl donors, and the 66-aminoamino-groups of peptidylpeptidyl- and peptidepeptide-bound lysine residues act as acceptors, to give intraintra- and interinter-molecular N6N6-(5(5-glutamyl)lysine crosscross-links. 4 Biological Process Biological Process refers to the broad biological objective or goal goal in which a gene product participates. Biological Process includes the areas of development, cell communication, physiological processes, and behavior. An example of a broad process term is “mitosis” (the division of the eukaryotic cell nucleus to produce two daughter nuclei that, usually, usually, contain the identical chromosome complement to their mother). An example of a more detailed process term is “calcium“calcium-dependent cellcell-matrix adhesion” (the binding of a cell to the extracellular matrix matrix via adhesion molecules that require the presence of calcium for the interaction). Cellular Component Cellular Component refers to the location of action for a gene product. This location may be a structural component of a cell, such as the nucleus. It can also refer to a location as part of a molecular molecular complex, such as the ribosome. How are GO Vocabularies Constructed? GO vocabularies are updated and modified on a regular basisA small number of GO curators are empowered to make additions to and deletions from GO. A monthly snapshot of XML format files of GO vocabularies is saved and posted on the GO Web site. 5 Example: Gene Product = hammer Function (what) Process (why) Drive nail (into wood) Carpentry Drive stake (into soil) Gardening Smash roach Pest Control Clown’s juggling object Entertainment What’s in a name? The same name can be used to describe different concepts 6 What’s in a name? Molecular Function A single reaction or activity, not a gene product A gene product may have several functions Sets of functions make up a biological process 7 What’s in a name? Glucose synthesis Glucose biosynthesis Glucose formation Glucose anabolism Gluconeogenesis All refer to the process of making glucose from simpler components What’s in a name? The same name can be used to describe different concepts A concept can be described using different names Æ Comparison is difficult – in particular across species or across databases 8 Ontology Structure Ontologies can be represented as graphs, where the nodes are connected by edges Nodes = concepts in the ontology Edges = relationships between the concepts node edge node node Ontology Structure The Gene Ontology is structured as a hierarchical directed acyclic graph (DAG) Terms can have more than one parent and zero, one or more children Terms are linked by two relationships – is-a – part-of 9 Simple hierarchies (Trees) Directed Acyclic Graphs Single parent One or more parents Directed Acyclic Graphs (DAG) protein complex organelle mitochondrion [other protein complexes] [other organelles] fatty acid beta-oxidation multienzyme complex is-a part-of 10 Parent-Child Relationships Nucleus Nucleoplasm A child is a subset of a parent’s elements Nuclear envelope Nucleolus Chromosome Perinuclear space The cell component term Nucleus has 5 children True Path Rule The path from a child term all the way up to its top-level parent(s) must always be true is-a cell part-of Ê cytoplasm Ê chromosome L nuclear chromosome L cytoplasmic chromosome L mitochondrial chromosome Ê nucleus Ê nuclear chromosome L Ê 11 What’s in a GO term? term: gluconeogenesis id: GO:0006094 definition: The formation of glucose from noncarbohydrate precursors, such as pyruvate, amino acids and glycerol. No GO Areas GO covers ‘normal’ functions and processes – No pathological processes – No experimental conditions NO evolutionary relationships NO gene products NOT a system of nomenclature 12 Annotation of gene products with GO terms Mitochondrial P450 Cellular component: mitochondrial inner membrane GO:0005743 Biological process: Electron transport GO:0006118 substrate + O2 = CO2 +H20 product Molecular function: monooxygenase activity GO:0004497 13 Why modify the GO GO reflects current knowledge of biology New organisms being added makes existing terms arrangements incorrect Not everything perfect from the outset What can scientists do with GO? • Access gene product functional information • Find how much of a proteome is involved in a process/ function/ component in the cell • Map GO terms and incorporate manual annotations into own databases • Provide a link between biological knowledge and … • gene expression profiles • proteomics data 14 Microarray analysis Whole genome analysis (J. D. Munkvold et al., 2004) …analysis of high-throughput data according to GO MicroArray data analysis time Defense response Immune response Response to stimulus Toll regulated genes JAK-STAT regulated genes Puparial adhesion Molting cycle hemocyanin Amino acid catabolism Lipid metobolism Peptidase activity Protein catabloism Immune response Immune response Toll regulated genes attacked control Bregje Wertheim at the Centre for Evolutionary Genomics, 15 Functional categories in eukaryotic proteomes. The categories were derived from functional classification systems, including the Gene Ontology project. (Figure 37 in {Lander, Linton, et al. 2001 8 /id} Distribution of the molecular functions of the 26,383 human proteins. Each slice lists the numbers and percentages (in parentheses) of gene functions assigned to a given category of molecular function. The outer circle shows the assignment to molecular function categories in the Gene Ontology (GO) (Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium (2000) Nature Genet. 25: 25-29), and the inner circle shows the assignment to Celera's Panther molecular function categories. (Figure 15 in {Venter, Adams, et al. 2001 1181 /id}) 16 CONSURF Conservation scores of residues in proteins Two versions available Pre-compiled results are availabble 17 PFAM Hundreds of thousands of protein sequences are now known and the deluge of data shows no signs of slowing. The sequence analysis of proteins may seem like a perpetual (continuous) task. However, the the majority of protein sequences appear to fall into a few thousand protein families (Chothia, Chothia, 1992). 1992). Very often these families are representative of proteins at the domain level, where domains are discrete structural units that are frequently found in different protein contexts. Pfam is a database of such protein domain families (Sonnhammer (Sonnhammer et al., 1997; 1997; Bateman et al., 2002), 2002), with each family represented by multiple sequence alignments and profile hidden Markov models (HMMs ). In (HMMs). addition, each family has associated annotation, literature references, references, and links to other databases. The entries in Pfam are available via the Web and in flatfile format. 18 19 FUNCTIONAL DOMAIN 20 download SCOP database CONSURF database PFAM database PDB GO database structure domain function 21 PIR 22 Entrez a client-server system for retrieval of information related to molecular biology can be used – via web page – via "embedded" client in other software provided by National Center for Biotechnology Information, part of the National Library of Medicine (NIH) 23 Entrez Databases PubMed: The biomedical literature – PUBMED database contains Medline abstracts as well as links to full text articles on sites maintained by journal publishers Nucleotide sequence database (Genbank) Protein sequence database Structure: three-dimensional macromolecular structures Genome: complete genome assemblies PopSet: population study data sets 24 Entrez Databases OMIM: Online Mendelian Inheritance in Man Taxonomy: organisms in GenBank Books: online books ProbeSet: Gene Expression Omnibus (GEO) 3D Domains: domains from Entrez Structure Entrez essentials Semi-automated entry of information into databases Critical to usefulness is the links between databases 25 Entrez literature searching can find papers on a given subject can find papers on a specific gene can find papers related to a given paper can switch between literature and sequence databases Pubmed has links to publishers’ websites to view full text of articles Pubmed Central has free full text copies Entrez sequence searching can find sequences for a given gene or protein can download copy of sequence 26 Example Entrez Session Goal: Find literature and sequences for cystic fibrosis genes – Use OMIM with Keyword searching. – Switch to Protein database to see sequence. – Change to GenPept format to save sequence. – Switch to Nucleotide database to see sequence. – Use neighbor feature to find related articles. – Use MESH terms to find similar articles. – Search the Nucleotide database by gene name. Example Entrez Session 27 Example Entrez Session Example Entrez Session 28 Example Entrez Session Example Entrez Session 29 Block Diagram for Entrez Literature Searching Results of Previous Search Additional Search Criterion Displayed Item Selection Entrez Search Engine Results of Search (List) Item Display Desired Output Format 30