Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Transposable element wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

X-inactivation wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Copy-number variation wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Oncogenomics wikipedia , lookup

Essential gene wikipedia , lookup

Point mutation wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Public health genomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Gene therapy wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Minimal genome wikipedia , lookup

Gene desert wikipedia , lookup

The Selfish Gene wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene nomenclature wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Nutriepigenomics wikipedia , lookup

NEDD9 wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene expression programming wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene wikipedia , lookup

Genome (book) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Designer baby wikipedia , lookup

Transcript
What is an Ontology
• An ontology is a set of terms, relationships and
definitions that capture the knowledge of a certain
domain. (common ontology ≠ common knowledge)
• Terms represent a controlled vocabulary, and
define the concepts of a domain.
• Terms are linked by relationships, which constitute
a semantic network.
• Ontologies augment natural language annotations
and can be more easily processed
computationally. (becomes the language of the
domain it describes for communication,
coordination and collaboraton)
Why We Need Ontology in Bioinformatics
• Biologists need knowledge in order to perform their
work.
• Sequence comparison to infer the function.
• Biologists need knowledge for communication, but
such knowledge may be represented in different
ways.
• Different use of gene:
• The coding region of DNA
• DNA fragment that can be transcripted and translated
into a protein
• DNA region of biological interest with a name and that
carries a genetic trait or phenotype
The Gene Ontology (GO)
• Provides structured vocabularies for
describing gene products in the domain of
molecular biology.
• Enables a common understanding of model
organisms and between databases
• Consisted of three structurally unlinked
hierarchies (molecular function, biological
process and cellular component).
• 2 types of relationships between terms:
• is-a: subclass.
• part-of: physical part of, or subprocess of.
Why Gene Ontology?
• Without structured vocabularies, different sources can
refer to the same concept using different terms (e.g.,
cdc54 in yeast is MCM4 in mouse).
• What is a well-known shorthand in one research
community is gibberish in another. Contributions by one
research community may not be recognized by others.
• Without coordination, research work may be duplicated.
• The goal of the Gene Ontology Consortium is to
produce a controlled vocabulary that can be applied to
all eukaryotes even as knowledge of gene and protein
roles in cells is accumulating and changing.
Three GO Hierarchies
• Molecular function: elemental activity/task
(what)
(e.g., DNA-binding, polymerase, transcription factor)
(what a gene does at the biochemical level)
• Biological process: goal or objective
(why)
(e.g., mitosis, DNA replication, cell cycle control)
(A broad biological perspective – not currently a pathway)
• Cellular component: location within cellular
structures and macromolecular complex
(where)
(e.g., nucleus, ribosome, pre-replication complex)
(Each GO hierarchy has a DAG structure. A child
term may have many parent terms)
(Gene Ontology information can be accessed at
http://www.geneontology.org/)
Example: Gene Ontology Hierarchy
Biological process
(GO:0008150)
i
i
Development
(GO:0007275)
i
P
P
is a
part of
…
i
…
Cell growth
(GO:0008151)
Cell aging
Programmed
(GO:0007569) (GO:0012501)
… … …
Behavior
(GO:0007610)
…
i
P
i
Physiological
(GO:0007582)
i …
i
Communication
Cell death
(GO:0007154) (GO:0008219)
… … …
i
Cellular process
(GO:0009987)
i
… … …
i
…
… … …
i
Induction
Apoptosis
(GO:0012502) (GO:0006915)
i
… … …
…
i
HS response Autophagic cell death
(GO:0009626)
(GO:0048102)
… … …
…
i is-a
P part-of
Gene Annotation Using GO Terms
• Association of GO terms with gene products
based on evidence from literature reference or
computational analysis.
• The creation of GO and the association of GO
terms with gene products (gene annotation)
are two independent operations.
• A gene can be associated with one or more
GO terms (gene categories), and one category
normally has many genes (many-to-many
relationship between genes and GO terms)
Gene Product Associations to an Ontology
yeast
fly
mouse
ID
Term
Definition
Ontology
Synonyms
Is-a| Part-of
Node1 ID
Node2 ID
GO ID
DB ID
Evidence code
Reference Citation
NOT
Example: Part of Molecular Function
Example: Part of Biological Process
Example: Part of Cellular Component
Genes of a Biological Process Tend to Be Co-Regulated
Gene Names
Biological
Process
Use Gene Ontology (GO) to Annotate Genes
• GO URL: http://www.geneontology.org/
• Two concepts:
• Gene Ontology: Provides structured vocabularies
for describing gene products in the domain of
molecular biology (all species share the same gene
ontology)
• Annotations: Association of GO terms with gene
products based on evidence from literature reference
or computational analysis (each species has a
separate annotation file)
The Gene Ontology (GO)
• GO file:
http://www.geneontology.org/ontology/gene_ontology.obo
• An example of GO term
[Term]
id: GO:0000001 (A unique id for the GO term)
name: mitochondrion inheritance (The name of the GO term)
namespace: biological_process (see next slide)
def: "The distribution of mitochondria, including the mitochondrial
genome, into daughter cells after mitosis or meiosis, mediated by
interactions between mitochondria and the cytoskeleton."
[PMID:10873824, PMID:11389764, SGD:mcc] (A detailed description
of the GO term)
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution
Gene Annotation Using GO Terms
•
•
•
http://www.geneontology.org/GO.current.annotations.shtml
Select the annotation file for a particular species
An example of an annotation entry for yeast
SGD
S000004660
AAC1
GO:0005743
SGD_REF:S000050955|PMID:2167309
TAS
ADP/ATP translocator
YMR056C
gene
C
taxon:4932
“AAC1” is the gene name
“GO:0005743” is the GO id, we can link it to the corresponding item in the
ontology file
“SGD_REF:S000050955|PMID:2167309” is where this annotation comes from
“C” means this annotation belongs to the “cellular component” namespace
“ADP/ATP translocator” is a brief description of this annotation
“YMR056C” is another name for this gene
“taxon:4932” means this is a yeast gene
Gene Annotation Using GO Terms
Given a list of genes L from a specific species Sj
1) go to http://www.geneontology.org/GO.current.annotations.shtml
2) select and download the annotation file Fj for Sj
For each gene Gi in list L
3) find the annotation entry Ek for Gi in Fj
4) find the GO term id from entry Ek
5) go to http://www.geneontology.org/ontology/gene_ontology.obo
6) find the GO term in the ontology file, the GO term provides more
detailed annotation for this gene
Use of GO to Annotation Genes
Problem: Given a list of n genes, whether they are
significantly associated with a specific GO term ?
Solution: Calculate the p-Value.
Notations
Total number of genes in the data set : N
Total number of genes assigned to term T: M
Number of genes in the list: n
Number of genes in the list and assigned to term T: m
How to Assess Overrepresentation
of a GO Term?
Genes on an array:
Total number of genes (N):
Number of genes – cell cycle (M):
Genes in a cluster:
Number of genes in the cluster (n):
Number of genes – cell cycle (m):
2,285
161
147
25
Is the GO term (i.e., cell cycle) significantly
overrepresented in the cluster?
Hyper-geometric Distribution
Given the total number of genes in the data set
associated with term T is M, if randomly draw n
genes from the data set N, what is the
probability that m of the selected n genes will
be associated with T?
 M  N  M 
 

m  n  m 

Pr( m | N , M , n) 
N
 
n
P-Value
Based on Hyper-geometric distribution, the
probability of having m genes or fewer associated to T
in N can be calculated by summing the probabilities of
a random list of N genes having 1, 2, …, m genes
associated to T. So the p-value of over-representation
is as follows:
 M  N  M 
 

min( M , n ) 
i  n  i 

p  
N
i m
 
n
MAPPFinder
• A tool for mapping
gene expression
data to the GO
hierarchies.
• Part of the free
software package
GenMAPP.
• Available at
http://www.genmapp.org/.
(Doniger et al., 2003)
MAPPFinder Sample Output
(Doniger et al., 2003)
GoMiner
• A client-server application using Java (data on the server side).
• Available at http://discover.nci.nih.gov/gominer/.
(Zeeberg et al., 2003)
Onto-Express
• A web application for GO-based microarray data
analysis (http://vortex.cs.wayne.edu/Projects.html).
• The input to Onto-Express is a list of Affymetrix
probe IDs, GenBank sequence accessions or
UniGene cluster IDs.
• Part of the integrated Onto-Tools, including:
– Onto-Compare: compare commercial arrays.
– Onto-Design: help array design (probe selection).
– Onto-Translate: provide mapping of different IDs.
p
GO
# genes
(Genes linked to poor breast cancer outcome)