* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download the Gene Ontology
Survey
Document related concepts
Genomic library wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Gene desert wikipedia , lookup
Expression vector wikipedia , lookup
Non-coding DNA wikipedia , lookup
Point mutation wikipedia , lookup
Molecular ecology wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene expression wikipedia , lookup
Gene regulatory network wikipedia , lookup
NADH:ubiquinone oxidoreductase (H+-translocating) wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Gene expression profiling wikipedia , lookup
Transcript
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009 Genomic Annotation 1. 2. Genome annotation is the process of attaching biological information to genomic sequences. It consists of two main steps: identifying functional elements in the genome: “structural annotation” attaching biological information to these elements: “functional annotation” biologists often use the term “annotation” when they are referring only to structural annotation Structural annotation: DNA annotation CHICK_OLF6 Protein annotation TRAF 1, 2 and 3 Data from Ensembl Genome browser TRAF 1 and 2 Functional annotation: catenin Structural & Functional Annotation Structural Annotation: Open reading frames (ORFs) predicted during genome assembly predicted ORFs require experimental confirmation the Sequence Ontology (SO) provides a structured controlled vocabulary for sequence annotation Functional Annotation: annotation of gene products = Gene Ontology (GO) annotation initially, predicted ORFs have no functional literature and GO annotation relies on computational methods (rapid) functional literature exists for many genes/proteins prior to genome sequencing GO annotation does not rely on a completed genome sequence! Introduction to GO 1. 2. Bio-ontologies the Gene Ontology (GO) 3. 4. a GO annotation example GO evidence codes literature biocuration & computation analysis ND vs no GO sources of GO Using the GO The gene association file 1. Bio-ontologies Bio-ontologies Bio-ontologies are used to capture biological information in a way that can be read by both humans and computers. necessary for high-throughput “omics” datasets allows data sharing across databases Objects in an ontology (eg. genes, cell types, tissue types, stages of development) are well defined. The ontology shows how the objects relate to each other. Bio-ontologies: http://www.obofoundry.org/ relationships between terms Ontologies digital identifier (computers) description (humans) 2. The Gene Ontology Functional Annotation Gene Ontology (GO) is the de facto method for functional annotation Widely used for functional genomics (high throughput) Many tools available for gene expression analysis using GO The GO Consortium homepage: http://www.geneontology.org GO Mapping Example NDUFAB1 (UniProt P52505) Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa Biological Process (BP or P) GO:0006633 fatty acid biosynthetic process TAS GO:0006120 mitochondrial electron transport, NADH to ubiquinone TAS GO:0008610 lipid biosynthetic process IEA NDUFAB1 Molecular Function (MF or F) GO:0005504 fatty acid binding IDA GO:0008137 NADH dehydrogenase (ubiquinone) activity TAS GO:0016491 oxidoreductase activity TAS GO:0000036 acyl carrier activity IEA Cellular Component (CC or C) GO:0005759 mitochondrial matrix IDA GO:0005747 mitochondrial respiratory chain complex I IDA GO:0005739 mitochondrion IEA GO Mapping Example NDUFAB1 (UniProt P52505) Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa GO:ID (unique) aspect or ontology Biological Process (BP or P) GO:0006633 fatty acid biosynthetic process TAS GO:0006120 mitochondrial electron transport, NADH to ubiquinone TAS GO:0008610 lipid biosynthetic process IEA NDUFAB1 GO term name GO:0005504 GO:0008137 GO:0016491 GO:0000036 Molecular Function (MF or F) fatty acid binding IDA NADH dehydrogenase (ubiquinone) activity TAS oxidoreductase activity TAS acyl carrier activity IEA Cellular Component (CC or C) GO:0005759 mitochondrial matrix IDA code GO evidence GO:0005747 mitochondrial respiratory chain complex I IDA GO:0005739 mitochondrion IEA GO EVIDENCE CODES Direct Evidence Codes GO Mapping IDA - inferred fromExample direct assay IEP - inferred(UniProt from expression NDUFAB1 P52505)pattern IGIBovine - inferred fromdehydrogenase genetic interaction NADH (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa IMP - inferred from mutant phenotype IPI - inferred from physical interaction Biological Process (BP or P) GO:0006633 fatty acid biosynthetic process TAS Indirect Evidence Codes GO:0006120 mitochondrial electron transport, NADH to ubiquinone TAS inferred from literature GO:0008610 lipid biosynthetic process IEA IGC - inferred from genomic context TAS - traceable author statement Molecular Function (MF or F) NAS - non-traceable author statement GO:0005504 fatty acid binding IDA IC - inferred by curator GO:0008137 NADH dehydrogenase (ubiquinone) activity TAS inferred by sequence analysis GO:0016491 oxidoreductase activity TAS NDUFAB1 RCA - inferred from reviewed GO:0000036 computational acylanalysis carrier activity IEA IS* - inferred from sequence* IEA - inferred from electronic annotation Cellular Component (CC or C) GO:0005759 mitochondrial matrix IDA Other ISS - inferred from sequence structural similarity GO:0005747 mitochondrial respiratoryorchain complex I IDA NR - not recorded (historical) GO:0005739 ISA - inferred from mitochondrion IEA sequence alignment ND - no biological data available ISO - inferred from sequence orthology ISM - inferred from sequence model GO EVIDENCE CODES Direct Evidence Codes GO Mapping IDA - inferred fromExample direct assay IEP - inferred from expression pattern IGI - inferred from genetic interaction IMP - inferred from mutant phenotype IPI - inferred from physical interaction Biocuration of literature • detailed function • “depth” • slower (manual) Indirect Evidence Codes inferred from literature IGC - inferred from genomic context TAS - traceable author statement NAS - non-traceable author statement IC - inferred by curator inferred by sequence analysis RCANDUFAB1 - inferred from reviewed computational analysis IS* - inferred from sequence* IEA - inferred from electronic annotation Other NR - not recorded (historical) ND - no biological data available ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model P05147 Biocuration of Literature: detailed gene function Find a paper about the protein. PMID: 2976880 Read paper to get experimental evidence of function Use most specific term possible experiment assayed kinase activity: use IDA evidence code GO EVIDENCE CODES Direct Evidence Codes GO Mapping IDA - inferred fromExample direct assay IEP - inferred from expression pattern IGI - inferred from genetic interaction IMP - inferred from mutant phenotype IPI - inferred from physical interaction Biocuration of literature • detailed function • “depth” • slower (manual) Indirect Evidence Codes inferred from literature IGC - inferred from genomic context TAS - traceable author statement NAS - non-traceable author statement IC - inferred by curator inferred by sequence analysis RCANDUFAB1 - inferred from reviewed computational analysis IS* - inferred from sequence* IEA - inferred from electronic annotation Other NR - not recorded (historical) ND - no biological data available Sequence analysis • rapid (computational) • “breadth” of coverage • less detailed ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model Unknown Function vs No GO ND – no data Biocurators have tried to add GO but there is no functional data available Previously: “process_unknown”, “function_unknown”, “component_unknown” Now: “biological process”, “molecular function”, “cellular component” No annotations (including no “ND”): biocurators have not annotated this is important for your dataset: what % has GO? Sources of GO 1. Primary sources of GO: from the GO Consortium (GOC) & GOC members 2. most up to date most comprehensive Secondary sources: other resources that use GO provided by GOC members public databases (eg. NCBI, UniProtKB) genome browsers (eg. Ensembl) array vendors (eg. Affymetrix) GO expression analysis tools Different tools and databases display the GO annotations differently. Since GO terms are continually changing and GO annotations are continually added, need to know when GO annotations were last updated. Secondary Sources of GO annotation EXAMPLES: public databases (eg. NCBI, UniProtKB) genome browsers (eg. Ensembl) array vendors (eg. Affymetrix) CONSIDERATIONS: What is the original source? When was it last updated? Are evidence codes displayed? For more information about GO GO Evidence Codes: http://www.geneontology.org/GO.evidence.shtml gene association file information: http://www.geneontology.org/GO.format.annotation.shtml tools that use the GO: http://www.geneontology.org/GO.tools.shtml GO Consortium wiki: http://wiki.geneontology.org/index.php/Main_Page 3. Using the GO Use GO Browsers for: searching for GO terms searching for gene product annotation filtering sets of annotations and downloading results creating/using GO slims GO Browsers QuickGO Browser (EBI GOA Project) http://www.ebi.ac.uk/ego/ Can search by GO Term or by UniProt ID Includes IEA annotations AmiGO Browser (GO Consortium Project) http://amigo.geneontology.org/cgi- bin/amigo/go.cgi Can search by GO Term or by UniProt ID Does not include IEA annotations Use GO for……. Determining which classes of gene products are over-represented or under-represented. Grouping gene products by biological function. Relating a protein’s location to its function. Focusing on particular biological pathways and functions (hypothesis-driven data interrogation). http://www.geneontology.org/ However…. many of these tools do not support non-model organisms the tools have different computing requirements may be difficult to determine how up-to-date the GO annotations are… Need to evaluate tools for your system. Evaluating GO tools Some criteria for evaluating GO Tools: 1. Does it include my species of interest (or do I have to “humanize” my list)? 2. What does it require to set up (computer usage/online) 3. What was the source for the GO (primary or secondary) and when was it last updated? 4. Does it report the GO evidence codes (and is IEA included)? 5. Does it report which of my gene products has no GO? 6. Does it report both over/under represented GO groups and how does it evaluate this? 7. Does it allow me to add my own GO annotations? 8. Does it represent my results in a way that facilitates discovery? 4. gene association files The gene association (ga) file standard file format used to capture GO annotation data tab-delimited file containing 15* fields of information: Information about the gene product (database, accession, name, symbol, synonyms, species) information about the function: GO ID, ontology, reference, evidence, qualifiers, context (with/from) data about the functional annotation date, annotator * 2 additional fields will soon be added to capture information about isoforms and other ontologies. (additional column added to this example) gene product information metadata: when & who function information Gene association files GO Consortium ga files many organism specific files also includes EBI GOA files EBI GOA ga files UniProt file contains GO annotation for all species represented in UniProtKB AgBase ga files organism specific files AgBase GOC file – submitted to GO Consortium & EBI GOA AgBase Community file – GO annotations not yet submitted or not supported all files are quality checked