Download Ontologies_Stds

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Vectors in gene therapy wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genomics wikipedia , lookup

Designer baby wikipedia , lookup

Metagenomics wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

NEDD9 wikipedia , lookup

Transcript
Ontologies, data
standards and controlled
vocabularies
Why use standards and CVs?
• Very important in High-throughput biology to
sort through the vast amounts of data
• To use the same data labels universally
• To enable quick retrieval of data
• To enable easy comparison of data
• To remove ambiguities
What’s in a name?
• What is a cell?
What’s in a name?
• What is a cell?
OR
What’s in a name?
• What is a cell?
OR
What’s in a name?
• What is a cell?
Ambiguities in naming
• The same name can be used to describe different
concepts, e.g:
–
–
–
–
–
Glucose synthesis
Glucose biosynthesis
Glucose formation
Glucose anabolism
Gluconeogenesis
• All refer to the process of making glucose
• Makes it difficult to compare the information
• Solution: use Ontologies and Data Standards
Ontologies
• An ontology is a formal specification of
terms and relationships between them –
widely used in biology and boinformatics
(e.g. taxonomy)
• The relationships are important and
represented as graphs
• Ontology terms should have definitions
• Ontologies are machine-readable
• They are needed for ordering and
comparing large data sets
Gene Ontology (GO)
• http://www.geneontology.org
• Many annotation systems are organism-specific or
different levels of granularity
• GO introduced standard vocabulary first used for
mouse, fly and yeast, but now generic
• Three ontologies: molecular function, biological
process and cellular component
GO Ontologies
•Molecular function: tasks performed by gene
product –e.g. G-protein coupled receptor
•Biological process: broad biological goals
accomplished by one or more gene products –e.g. Gprotein signaling pathway
•Cellular component: part(s) of a cell of which a
gene product is a component; includes extracellular
environment of cells –e.g nucleus, membrane etc.
GO hierarchy
Relationships:
“is-a”
“part of”
How do gene products get GO
terms?
• Electronic annotation:
– Through mappings to other biological entities and
then automatic inference to proteins
• Manual annotation:
– Model organism databases
– Gene Ontology Annotation (GOA) project
• Evidence codes –attached to all GO
annotations to show the source
Evidence Codes
IEA
Inferred from Electronic Annotation
IDA
IMP
IPI
IEP
Inferred from Direct Assay
Inferred from Mutant Phenotype
Inferred from Protein Interaction
Inferred from Expression Pattern
IGI
ISS*
IGC
Inferred from Genetic Interaction
Inferred from Sequence or Structural Similarity
Inferred from Genomic Context
RCA
TAS
Reviewed Computational Analysis
Traceable Author Statement
NAS
Non-traceable Author Statement
IC
Inferred from Curator Judgement
ND
No Data available
Electronic annotation: GO mappings
Electronic annotation: GO mappings
Fatty acid biosynthesis
(SwissProt keyword)
EC:6.4.1.2
(EC number)
IPR000438: Acetyl-CoA
carboxylase carboxyl transferase
beta subunit
(InterPro entry)
MF_00527: Putative 3methyladenine DNA
glycosylase
GO:fatty acid biosynthesis
(GO:0006633)
GO:acetyl-CoA carboxylase
activity
(GO:0003989)
GO:acetyl-CoA carboxylase
activity
(GO:0003989)
GO:DNA repair
(GO:0006281)
(HAMAP)
Camon et al. BMC Bioinformatics. 2005; 6 Suppl 1:S17
UniProt
entry
Automatic transfer of annotations to orthologs
Ensembl GO
term projection
via gene
homology
Cow
Mouse
COMPARA
Rat
Homologies between
different species
calculated
Mouse
Drosophila
GO terms projected
from MANUAL
annotation only
(IDA, IEP, IGI, IMP,
IPI)
One-to-one and
apparent one-to-one
orthologies only used.
Dog
Chicken
Rat
Dog
Cow
Anopheles
http://www.ensembl.org/info/data/compara
Manual annotation: GOA Project
•
•
•
•
•
Largest open-source contributor of annotations to GO
Member of the GO Consortium since 2001
Provides annotation for more than 130,000 species
GOA’s priority is to annotate the human proteome
GOA is responsible for human, chicken, bovine and
many other annotations for the GO Consortium
• Annotation is done through reading of the literature
Reference Genomes
• Comprehensive annotation of a set of disease-related proteins in human
• Generate a reliable set of GO annotations for the 12 selected genomes
• Empowers comparative methods used in first pass annotation of other proteomes.
Arabidopsis thaliana
Caenorhabditis elegans
Danio rerio (zebrafish)
Dictyostelium discoideum
Drosophila melanogaster
Escherichia coli
Homo sapiens
Saccharomyces cerevisiae
Mus musculus
Schizosaccharomyces pombe
Gallus gallus
Rattus norvegicus
Accessing GO data (1)
http://amigo.geneontology.org/cgi-bin/amigo/go.cgi
Accessing GO data (2)
QuickGO browser
Human Insulin Receptor (P06213)
http://www.ebi.ac.uk/quickgo
Accessing GO data (3)
Gene Association Files
http://www.geneontology.org/GO.current.annotations.shtm
Accessing GO data (3)
Gene Association File example
Downloading GOA data
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/
http://www.ebi.ac.uk/GOA/downloads.html
Uses of GO 1
Functional annotation of proteins
Uses of GO 2
Find functional information on interaction proteins (IntAct)
Uses of Uses
GOAof GO 3
Microarray data analysis
Analysis of high-throughput data
GO classification
Larkin JE et al, Physiol Genomics, 2004
Cunliffe HE et al, Cancer Res, 2003
Proteomics data analysis
GO classification
Other Ontologies:
Open Biomedical Ontologies
http://obo.sourceforge.net
• Central location for accessing well-structured
controlled vocabularies and ontologies for use
in the biological and medical sciences.
• Provides simple format for ontologies that can
encode terms, relationships between terms and
definitions of terms including those taken from
external ontologies.
Scope of Open Biomedical Ontologies
•
•
•
•
•
•
•
•
•
•
•
•
Anatomy
Animal natural history and life history
Chemical
Development
Ethology
Evidence codes
Experimental conditions
Genomic and proteomic
Metabolomics
OBO relationship types
Phenotype
Taxonomic classification
Ontology Lookup Service (OLS)
• Single point of query for currently 47 ontologies.
• Ontologies are updated daily from CVS
repositories, including the OBO CVS repository
and the PRIDE CVS repository.
• A tool that offers interactive and programmatic
interfaces for queries on term names, synonyms,
relationships, annotations and database crossreferences.
• Originally developed for using ontologies in
PRIDE.
The issue faced
• These relationships have
consequences when
querying a database
annotated using the
ontology.
• What happens when I ask
for PRIDE experiments
describing the proteome of
brain tissue?
Using Ontologies in PRIDE
For an experiment you want to define:
– Species: Newt / NCBI Taxonomy ID
– Tissue / organ / cell type: BRENDA Tissue
ontology, Cell Type ontology;
– Sub-cellular component: Gene Ontology: GO;
– Disease: Human Disease: DOID;
– Genotype: GO;
– Sample Processing: PSI Ontology;
– Mass Spectrometry: PSI-MS Ontology;
– Protein Modifications: PSI-MOD Ontology
OLS usage examples
• http://www.ebi.ac.uk/ontology-lookup/
• What is the accession for “mitochondrion” in GO? In MeSH?
– search by term name in a specific ontology or across all
• I’m looking for a term to annotate my protocol step but I’m not
sure what term to use.
– browse an ontology
• I’m looking for all the experiments done on liver tissue?
– get all children term of liver and query on those as well
• My data set was annotated with GO version 123 but that was a
long time ago?
– get updated term names for the identifiers you have and see if any have
been made obsolete
Standards for data exchange
• Systems Biology Markup Language (SBML) –
computer-readable format for representing
models of networks
• Biological Pathways Exchange (BioPAX) –
format for representing pathways
• Proteomics Standards Initiative (PSI, MIAPE)
• Microarray standards –MIAME and MAGE
MIAPE/MIAME principles
• Enough information to:
–
–
–
–
Remove ambiguity in experiment
Allow easy interpretation of results
Allow experiment to be repeated
Enable comparison across similar experiments
• Use controlled vocabularies
Using ontologies and standards
• So much data in different places –need to
organize and share it
• Used for data retrieval and comparison –
easier to query
• Used for data integration and exchange –
standard representation
• Used for evaluation –need “gold
standard”