Download PowerPoint Presentation - Ontologies for biological annotation

Document related concepts

Designer baby wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Epigenetics in stem-cell differentiation wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

NEDD9 wikipedia , lookup

Transcript
Linking Multiple Ontologies:
The OBO Foundry Approach
Chris Mungall
NIAID Cell Ontology Workshop
May 2008
Outline
• Introduction to ontologies
– The OBO perspective
– Case study in the Gene Ontology
•
•
•
•
The OBO Foundry: goals and principles
The OBO relation ontology
Organization of ontologies in OBO
Modularity
– An example from CL
• Linking CL to the OBO Foundry
What is an ontology?
• A computable
representation of
some domain
– What kinds of
things exists
– What are the
relations that hold
between them?
Cardiovascular
Cavitated organ
System
is_a
part_of
Heart
part_of
Mitral valve
part_of
Aortic valve
Aspects of an ontology
• Identifiers
– Uniquely identify a class / term
• E.g. CL:0000037 is ID for the term “hematopoietic
stem cell”
– Identifier metadata
• Terminological aspects
– Names and synonyms/alternate labels
• CL:0000037 has “hemopoietic progenitor cell”
as a related synonym and “hemopoietic stem cell”
as exact synonym
• Logical aspects
– Relations
– Definitions
Provenance
Some ontologies and their
uses
• The Gene Ontology
– Annotation of gene products
– Analyzing high-throughput datasets
• Anatomical ontologies (including CL)
–
–
–
–
Experimental metadata
Image annotation
Indicating location of gene expression
Creating Phenotypic descriptions
• Others
– NLP
– Annotating information models
Origins of OBO: The Gene
Ontology (GO)
• 3 ontologies for annotating genes and
gene products
Ontology
# terms
# links
Molecular function
Biological process
Cellular component
7889
13978
2034
9225
25065
3894
• These ontologies are organised as a collection of related
terms, constituting nodes in a graph
– Gradually incorporating other logical axioms
Annotation and GO
• GO Annotations:
– Associations between genes and GO terms, with evidence
– Met17 : “methionine metabolism” GO:0006555
• 222,000 genes and gene products have high quality
annotations to GO terms
– 3.4m including automated predictions
– 66,000 publications curated
• Variety of analysis tools
– http://www.geneontology.org/GO.tools.shtml#micro
GO and high-throughput biology:
Over-representation of GO terms for
gene sets
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
GO::TermFinder
Sherlock et al
GO and the need for OBO
• GO terms implicitly reference kinds of entities outwith
the scope of GO
chemical
cell
anatomy
OBO was born from the need to create source quality
–
–
–
–
•
Methionine biosynthesis
Neural crest cell migration
Cardiac muscle morphogenesis
Regulation of vascular permeability
ontologies for GO term ‘cross-products’
– Define composite classes in terms of simpler ones
The Open Biomedical
Ontologies (OBO) Foundry
• A collection of orthogonal reference
ontologies in the biological/biomedical
domain
• The OBO Foundry: Each is committed
to an agreed upon set of principles
governing best practices in ontology
development
Some OBO ontologies
Gene Ontology
• SO - sequence features
ChEBI - chemical entities • Model organism anatomy
OBI - investigations
– ZFA
PATO, MP - phenotypes
– Fly_anat
CL - cells
– Dicty_anat
ENVO - environment and
– Mouse_anat
habitat
–…
• DO - Human diseases
• CARO - common anatomy
• OBO Relation Ontology
• FMA - human anatomy
•
•
•
•
•
•
OBO Foundry: criteria, v1
• Open
• Well-defined exchange format
E.g. OBO or OWL
•
•
•
•
•
•
•
•
•
Uses identifiers according to OBO ID policy
Ontology Life-cycle / versioning
Has clearly specified and delineated content
Has unambiguous definitions
Uses or extends relations in the OBO Relation Ontology
Well documented
Has a plurality of users (and a mail list & issue tracker)
Developed collaboratively
Orthogonal, modular
http://obofoundry.org/
OBO Relation Ontology
• Edges can link nodes…
– Within ontologies
– Across ontologies
• The precise meaning of the relation is
important
– Relations have formal definitions
– Rules for composing relations together
– http://obofoundry.org/ro/
Is_a
• X is_a Y
– If something is an instance of X (at time t),
then it is also an instance of Y (at t)
• Transitive
– B1 B cell is_a B cell
– B cell is_a lymphocyte
– Therefore B1 B cell is_a lymphocyte
Part_of
• Instance level part_of relation is primitive
• Between classes:
– X part_of Y :
•
•
•
•
Every instance of X is part_of some instance of Y
Paneth cell part_of intestine : YES
Nucleus part_of Cell : YES
Neuron part_of brain : NO
– (there are some neurons that are part of others parts of the
nervous system)
• Transitive
– X part_of Y, Y part_of Z
• Therefore, X part_of Z
Has_part
• Instance level inverse of part_of
• X has_part Y
– Every X has some Y as part
– Cell has_part nucleus : NO
– Nucleate erythrocyte has_part nucleus :
YES
Develops_from
• X develops_from Y
– Every instance of X was once a Y, or inherited a
significant portion of its matter from a Y
• Example: erythrocyte develops_from reticulocyte
• Transitive
– erythrocyte develops_from reticulocyte
– reticulocyte develops_from orthochromatic
erythroblast
• =>
– erythrocyte develops_from orthochromatic
erythroblast
Transformation and derivation
• Develops_from relation can be refined into
two cases:
– Transformation_of
• X transformation_of Y :
– Any instance of X was previously an instance of Y
– Example: erythrocyte transformation_of reticulocyte
– Derives_from
• X derives_from Y :
– Holds between distinct instances where Y inherits matter
from X
• Most OBO ontologies just use the
develops_from relation
Other relations
• Inherence
– Between a quality and an object
– E.g. between a specific shape and a cell
• Participation
– Between a process and an object
– E.g. between a B cell and an immune
process
Definitions state necessary
and sufficient conditions
• Links in the ontology graph state necessary
conditions for a class
• E.g. erythroid progenitor cell develops_from
megakaryocyte erythroid progenitor
– These characteristics may not be unique
• A definition should state necessary and
sufficient conditions for a class
– The characteristics must be unique to the defined
class
• E.g. “progenitor cell that is committed to the erythroid
lineage”
• Definition should be precise and (as far as
possible) translated / translatable to logical
Genus differentia definitions
• Of the form
– An X is a G that D
– G should be in the same ontology
– D is discriminating characteristics that differentiate
(in the classification sense) Xs from other Gs.
• Relations to terms in an ontology (the same ontology or a
different one)
• Example:
– A B cell is a lymphocyte that expresses an
immunoglubulin complex
Orthogonality of ontologies
• No two ontologies should represent the same
kind of entity
– E.g. “B-cell” should only be represented in one
ontology
– Related entities should be coordinated across
ontologies
• GO: “B-cell differentiation”
• Exceptions:
– The term “cell” connects GO Cellular Component
(cell parts) and CL (cells)
• Advantages:
– Reduces redundancy and work
– Easier to make the union consistent
Some OBO terms..
bile
fat body
liver
obesity
liver
development
hepatoma
oenocyte
differentiation
hepatic
artery
oenocyte
hepatocyte
insulin
glucose
glycogen
increased
circulating
glucose
level
carbohydrate
metabolism
FMA
(adult
human)
FBbt
(fly)
fat body
bile
MP
(mammal
phenotype)
liver
MA
(mouse)
(biological
process)
obesity
liver
development
hepatoma
oenocyte
differentiation
hepatic
artery
oenocyte
CL
GO
DO
hepatocyte
PRO
insulin
glucose
glycogen
CHEBI
increased
circulating
glucose
level
carbohydrate
metabolism
FMA
(adult
human)
FBbt
(fly)
fat body
bile
MP
(mammal
phenotype)
liver
MA
(mouse)
(biological
process)
obesity
liver
development
hepatoma
oenocyte
differentiation
hepatic
artery
oenocyte
CL
GO
DO
hepatocyte
PRO
insulin
glucose
glycogen
CHEBI
increased
circulating
glucose
level
carbohydrate
metabolism
FMA
(adult
human)
FBbt
(fly)
MP
(mammal
phenotype)
liver
fat body
oenocyte
bile
MA
(mouse)
obesity
hepatic
How
should
we
artery
organize this?
hepatoma
CL
GO
(biological
process)
liver
development
oenocyte
differentiation
DO
hepatocyte
PRO
insulin
glucose
glycogen
CHEBI
increased
circulating
glucose
level
carbohydrate
metabolism
Top-level organisation (BFO:
Basic Formal Ontology)
• General categories
– 3D things (continuants)
• Independent
– Cells, organs,
molecules
• Dependent
– Shapes, sizes,
concentrations, …
– 4D things (processes)
• Processes
• Useful organisational
principle for OBO
• is_a and part_of should
not cross top level
• Levels of granularity
(scale)
–
–
–
–
–
Population
Organism
Organ
Cell
Molecule
• part_of relations can
cross levels
Objects
Qualities etc
FMA
(adult
human)
FBbt
(fly)
fat body
bile
MP
(mammal
phenotype)
liver
MA
(mouse)
GO
(biological
process)
obesity
liver
development
hepatoma
oenocyte
differentiation
hepatic
artery
oenocyte
CL
Processes
DO
hepatocyte
PRO
insulin
glucose
glycogen
CHEBI
increased
circulating
glucose
level
carbohydrate
metabolism
RELATI ON TO
TIME
GRANULARITY
ORGAN AND
ORGANISM
CELL AND
CELLUL AR
COMPONENT
MOL ECULE
CONTINUANT
INDEPENDENT
OCCURRENT
DEPENDENT
Organism
(NCBI
Taxonomy)
Anatomical
Entity
(FMA,
CARO)
Organ
Function
(FMP,
CPR O)
Cell
(CL)
Cellular
Compon ent
(FMA,GO)
Cellular
Function
(GO)
Molecule
(ChEBI, SO ,
RnaO, PrO)
Phenotypic
Quality
(PaTO)
Molecular Function
(GO)
OrganismLevel Process
(GO)
Cellular
Process
(GO)
Molecular
Process
(GO)
The OBO Foundry can help
with modular ontology design
• Biology is complex
– So our ontologies will be complex
– Multiple purposes
– Multiple means of classifying
• Separate out different aspects
– Modular approach
– Avoid multiple inheritance (>1 is_a parent)
• Don’t over-use is_a
• Don’t cross aspects with is_a
• Make complex descriptions from simpler
parts
Cysteine biosynthesis
(trimmed)
GO
Tangled polyhierarchy
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Cysteine biosynthesis
(trimmed)
Process axis
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Cysteine biosynthesis
(trimmed)
Chemical structure axis
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Cysteine biosynthesis
(trimmed)
ChEBI
(trimmed)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Cysteine biosynthesis
(trimmed)
ChEBI
(trimmed)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Cysteine biosynthesis
(trimmed)
ChEBI
(trimmed)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Cysteine biosynthesis
(trimmed)
We can do more than
simply link terms:
ChEBI
(trimmed)
Cross-products
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
(aka logical definitions,
Computable genusdifferentia definitions)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Cysteine biosynthesis
(trimmed)
Cysteine
=
ChEBI
biosynthesisGO:0019344 (trimmed)
biosynthetic process GO:0009058 } genus
a
that
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
results_in_creation_of
cysteine CHEBI:13536
} differentia
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
results_in_change_to
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Cysteine biosynthesitic process =
biosynthetic process that results_in_change_to cysteine
Let the computer
do the work..
Given cross-products,
A reasoner can add
all
links
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Underlying
representation is
normalized
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
CL
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Example of is_a-overloading:
OBO Cell Ontology
(current)
CL
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
X
•Try not to assert too many is_a parents
CL
GO
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
X
Has
function
•Reuse existing ontologies
•Non-is_a relation
?
How CL can use other OBO
ontologies
• GO Cellular component
– Mononuclear phagocyte
– B cell (expresses immunoglubulin complex)
• GO Biological process
– Photosynthetic cell
• PATO Qualities
– Spiny neuron
• CHEBI Chemical entities
– X secreting cell
• Anatomy Ontologies
– CNS neuron
Molecular function, PRO
- CD4 positive cell
How CL is used by other
ontologies
Ontology
Example
Genus
Differentia
GO-BP
T cell differentiation
Cell
differentiation
Results_in_acquisition_of_features_of
GO-CC
Germ cell nucleus
Nucleus
Part_of
MP
Abnormal
macrophage
morphology
Abnormal
morphology
Inheres_in
ZFA
(zebrafish)
OBI
erythrocyte
erythrocyte
In_organism
Example
Relationship
T cell
germ cell
macrophage
Danio
Has_part nucleus
DO (disease)
Ontology
Results
• Biological process x CL
• http://wiki.geneontology.org/index.php?XP:biological_pro
cess_xp_cell
– Uncovered inconsistencies between GO and CL
– Oenocyte differentiation is_a columnar/cuboidal
epithelial cell differentiation
• MP x CL
• http://wiki.geneontology.org/index.php/XP:mammalian_p
henotype_xp
– Resulted in various fixes to MP
OBD: Ontology Annotation
Database
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Summary
• The cell ontology is a representation of the
types of cell that exist
• The OBO Foundry provides
– Principles
– A framework for connecting ontologies
• There are many points of coordination
between CL and other OBO ontologies
• CL could benefit from the gradual introduction
of a modular approach
The Gene Ontology; and
beyond
• Curation of genes and
gene products
– Molecular function
– Biological process
– Cellular component
Multiple databases
using the same
ontology
GO
The Gene Ontology; and
beyond
• Curation of genes and gene
products
– Molecular function
– Biological process
– Cellular component
• What about curation of other
data types?
– Expression, transcriptomics
– Genetics, phenotypes and
disease
– Many others..
• OBO
– Open Bio-Ontologies
– Arose partly in response to
requirements outside scope
of GO
GO
Islands of biological data
Anatomy
ontologies
Phenotype
ontologies
QuickTime™ and a
TIFF (L ZW) d eco mpres sor
are nee ded to s ee this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
GO
Connecting the islands
QuickTime™ and a
TIFF (L ZW) d eco mpres sor
are nee ded to s ee this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Connecting the islands
QuickTime™ and a
TIFF (L ZW) d eco mpres sor
are nee ded to s ee this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Amino acid cross-products in GO:
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Bada et al : GO to ChEBI
http://www.berkeleybop.org/obol
http://www.berkeleybop.org/obol
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
• GO approach is retrospective
– Text based approaches to ‘decompose’ terms
• Obol
• Bada/Hunter
– Born of necessity
• OBO did not exist when GO started
– Hard work
• New ontologies should take the prospective
approach
– Separate out aspects from the outset
– No heuristic parsing necessary
Prospective approach:
Sequence Ontology
Separate
hierarchies created
from the outset
- cross-products
made from the
beginning
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
OBI: Ontology for Biomedical
Investigations
• Successor to MGED/FuGO
• Represents the realm of investigations
–
–
–
–
Biomaterials
Equipment
Protocols
Data transformations
• Makes maximal use of OBO
– PATO:
– ChEBI:
• Primary representation language is OWL
– Uses OWL translations at http://purl.org/obo/
Social Insect Behavior
Ontology
• 4 distinct hierarchies
–
–
–
–
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Anatomical entity
Behavior
Chemical entity
Species
• Links
– derives_from, between
chemical and
anatomical entity
• Future plans
– Submit chemical terms
to ChEBI
– Upper level behavior
ontology?
Anatomy
• GO is relevant for all kingdoms of life
• Development of anatomical ontologies has
been less coordinated
– Cell & subcellular: one ontology applicable to all
– Gross Anatomy: multiple ontologies
• Vertebrate:
–
–
–
–
–
–
MA + EMAP: Mouse
FMA: Human (adult)
EHDA: Human
ZFA: Zebrafish
TAO: teleost anatomy
XAO: Xenopus
•Invertebrate:
–FBbt: Drosophila anatomy
–Tick anatomy
–Mosquito anatomy
Anatomy: Ongoing work
• CARO
– Upper level shared anatomical ontology
– Very general terms
• Teleost anatomy ontology
– Broader than zebrafish anatomy ontology
– Will include homology links
• Linking cells to gross anatomical entity
poster
poster
poster
– Purkinje cell part_of cerebellum
– Spans ontologies (CL + ssAO)
• BIRNLex
• Stages and development
talk
Using multiple ontologies: Pre
vs post composition
• Complex descriptions (aka cross-products)
can be composed from 2 or more terms
– By ontology editors (pre)
– By curators (post)
• Example:
– Liver hyperplasia
• Precomposed phenotype ontology
– MP:0005141 “liver hyperplasia” increased size of liver due to
increased hepatocyte cell number
• Post-composition at time of genotype curation
– PATO:0000644 “hyperplastic”
– MA:0000358 “liver”
• Which strategy to choose?
• Either strategy can be used
• Or mixed and matched
– Caveat:
• Pre-composed terms must have computable definitions
(cross-products)
• Currently created retrospectively
• Current progress :
– MP (Mammalian Phenotype):
• 4136/5760 xp defs, partially vetted
• Caveat: species-specificity
– WormPhenotype:
• 350/1569 xp defs
– PlantTrait:
• 340/765 xp defs, partially vetted
Other ontologies
• Envo + GAZ
– Environmental ontology and gazetteer
– Habitats:
• Host (anatomy)
• Geographical features (eg hydrothermal vents)
– Qualities, chemical entities
• BIRNLex
• Protein Ontology
– Links to/from GO
• Complexes
• Functions of ancestral proteins
Envo-based annotation in Phenote
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Technical consequences of
modular approach
• Dependencies
– Technical issues
• Dependence on network?
• Formats - converters
– Social & management issues
– Change and versioning
• http://www.bioontologies.org/
• Managing dependencies
• http://obofoundry.org/wiki/index.php/Mappings
– Stable URLs for downloading ontologies in obo or
owl
http://purl.org/obo/
– OBO Identifier policy
Conclusions
• Be modular
– Distinct hierarchies
– Avoid is_a overloading
– Link to existing ontologies
• Rewards
– Standards
– Increases value of curated data
– Reduces duplication of effort and maximises
curation effort
– Ontologies are long term infrastructure
• It’s worth getting them right
Learning more
• http://www.bioontology.org
– National Center for Biomedical Ontology
– Browse and search OBO
– Coming soon: inter-ontology links
• http://obofoundry.org
– Principles and recommendations
– Participation
• Mailing lists
• Trackers
Restructuring
Cell.obo
OBO Cell Ontology
• Current version
– Overloading of is_a hierarchy
– Difficult to maintain
– Leads to “true path” violations
• Refactoring
– Replace is links with has_function
– Keep main axis structure-based (but not religiously so)
• For every term immediately under cell-by-function,
we made a new function term
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
propagation of genome
to circulate
to secrete
to metabolise
to contract
Electrical absorption
Barrier
Motility
Structural
to accumulate stuff
signaling (mitogenic)
to die
Defense
Transport
to photosynthesize
to support
Valve
to fix nitrogen
• Also create grouping
terms
• Replaced is_a links to cell-by-function terms
with has_function links to corresponding
function terms
• What do we do about the old cell-by-function terms?
• We can eliminate them..
• OR we can support them, but infer the ‘tangled DAG’
• Requires xp defs:
– Nitrogen fixing cell = cell THAT has_function nitrogen-fixing
• Future work / ongoing issues:
• Redundancy between cell functions & GO biological
process?
• Cell-by-lineage
Synchronizing ssAOs and CL
• Fly_anat, zfa, plant_anat all represent cell types
– Part_of links from cells to gross anatomy
• E.g. purkinje_cell part_of cerebellum
• Methodology
–
–
–
–
Xrefs from ssAOs to CL IDs
Treat as ss subtypes
Use reasoner to stay in sync
http://www.bioontology.org/wiki/index.php/CL:Aligning_speci
es-specific_anatomy_ontologies_with_CL
– Examples:
• http://www.berkeleybop.org/obol/#fly_anatomy_xp_cell-obol
Transformation_of
• Class-level relation between continuant types
• Transitive
• Relation between two classes, in which instances retain their
identity yet change their classification by virtue of some kind of
transformation. Formally: C transformation_of C' if and only if
given any c and any t, if c instantiates C at time t, then for some
t', c instantiates C' at t' and t' earlier t, and there is no t2 such
that c instantiates C at t2 and c instantiates C' at t2
Derives_from
• Holds between continuants
• transitive
• Derivation on the instance level (*derives_from*) holds between
distinct material continuants when one succeeds the other
across a temporal divide in such a way that at least a
biologically significant portion of the matter of the earlier
continuant is inherited by the later
• We say that one class C derives_from class C' if instances of C
are connected to instances of C' via some chain of instancelevel derivation relations.
• Examples:
– osteocyte derives_from osteoblast
RELATI ON TO
TIME
GRANULARITY
ORGAN AND
ORGANISM
CELL AND
CELLUL AR
COMPONENT
MOL ECULE
CONTINUANT
INDEPENDENT
OCCURRENT
DEPENDENT
Organism
(NCBI
Taxonomy)
Anatomical
Entity
(FMA,
CARO)
Organ
Function
(FMP,
CPR O)
Cell
(CL)
Cellular
Compo nent
(FMA,GO)
Cellular
Function
(GO)
Molecule
(ChEBI, SO ,
RnaO, PrO)
Phenotypic
Quality
(PaTO)
Molecular Function
(GO)
OrganismLevel Process
(GO)
Cellular
Process
(GO)
Molecular
Process
(GO)