Download Lecture 8 Annotating Gene Lists

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Transposable element wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genetic engineering wikipedia , lookup

X-inactivation wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Gene therapy wikipedia , lookup

Epistasis wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Metagenomics wikipedia , lookup

Oncogenomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Long non-coding RNA wikipedia , lookup

NEDD9 wikipedia , lookup

Gene desert wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Pathogenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Public health genomics wikipedia , lookup

Essential gene wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene expression programming wikipedia , lookup

Genome evolution wikipedia , lookup

Genomic imprinting wikipedia , lookup

Microevolution wikipedia , lookup

Gene wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genome (book) wikipedia , lookup

Designer baby wikipedia , lookup

Minimal genome wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Biological Interpretation of
Microarray Data
Helen Lockstone
DTC Bioinformatics Course
9th February 2010
Overview
• Interpreting microarray results
– Gene lists to biological knowledge
• The Gene Ontology Consortium
– Defined terms to describe gene function
• Functional analysis tools
– Methods
– DAVID/GSEA
Microarray Pipeline
Design and perform experiment
Process and normalise data
Statistical analysis
Differentially expressed genes
Biological interpretation
Biological Interpretation
• An obvious way to gain biological insight is
to assess the differentially expressed
genes in terms of their known function(s)
• Required an automated and objective
(statistical) approach
• Functional profiling or pathway analysis
Early functional analyses
• Manually annotate list of differentially expressed
(DE) genes
• Extremely time-consuming, not systematic, userdependent
• Group together genes with similar function
• Conclude functional categories with most DE
genes important in disease/condition under study
• BUT may not be the right conclusion
GO and functional analysis
Immune response
Metabolism
Transcription
Energy production
Neurotransmission
Protein transport
Functional category
Immune response
Metabolism
Transcription
Energy production
Neurotransmission
Protein transport
TOTAL
Number of sig genes
40
20
20
10
5
5
100
Immune response category contains 40% of all
significant genes - by far the largest category.
Reasonable to conclude that immune response may be
important in the condition being studied?
However ….
• What if 40% of the genes on the array were
involved in immune response?
• Only detected as many significant immune
response genes as expected by chance
• Need to consider not only the number of
significant genes for each category, but also
total number on the array
Same example, relative to array
Functional
category
Immune response
Metabolism
Transcription
Energy production
Neurotransmission
Protein transport
ALL
Number of genes
on array
8000
4000
2000
4000
200
1800
Actual number of
significant genes
40
20
20
10
5
5
20000
100
Expected number of
significant genes
40
20
10
20
1
9
Expected number of significant genes for category X = (num sig
genes ÷ total genes on array)*(num genes in category X on
array)
Same example, relative to array
Functional
category
Immune response
Metabolism
Transcription
Energy production
Neurotransmission
Protein transport
ALL
Number of genes
on array
8000
4000
2000
4000
200
1800
Actual number of
significant genes
40
20
20
10
5
5
20000
100
Expected number of
significant genes
40
20
10
20
1
9
• Now, transcription and neurotransmission categories
appear more interesting as many more significant
genes were observed than expected by chance
• Largest categories are not necessarily the most
interesting!
Major bioinformatic developments
• Requires annotating entire set of genes
• The Gene Ontology Consortium
(www.geneontology.org)
• Automated, statistical approaches for
annotating gene lists and performing
functional profiling
The Gene Ontology Consortium
GO Consortium
• Developed three structured and controlled
vocabularies (ontologies) that describe gene
products in terms of their associated
biological processes, cellular components
and molecular functions in a speciesindependent manner
• Has become a major resource for microarray
data interpretation
The Gene Ontology
• Molecular Function: basic activity or task
• Biological Process: broad objective or goal
• Cellular Component: location or complex
The Gene Ontology
• Molecular Function: basic activity or task
– e.g. catalytic activity, calcium ion binding
• Biological Process: broad objective or goal
– e.g. signal transduction, immune response
• Cellular Component: location or complex
– e.g. nucleus, mitochondrion
GO Structure
• Hierarchical tree
• Annotated with most
specific annotation,
forming path to top of
tree
• Genes annotated with
all relevant terms
• Annotations based on
published studies and
also electronic
inferences
GO Terms
• GO ID: GO:0007268
• GO term: synaptic transmission
• Ontology: biological process
• Definition: The process of communication
from a neuron to a target (neuron, muscle, or
secretory cell) across a synapse
Graphical view
http://www.ncbi.nlm.nih.gov/sites/entrez
Functional Profiling Tools
Functional profiling tools
Identify GO categories with significantly more DE
genes than expected by chance (i.e. overrepresented among DE genes relative to
representation on array as a whole)
Hypergeometric Distribution or Fisher’s Exact Test
Correct for testing multiple GO categories
Functional profiling tools
Khatri and Draghici. Ontological analysis of gene expression data: current
tools, limitations, and open problems. Bioinformatics (2005)
21(18):3587-95
Functional profiling tools
• Freely-available stand-alone/web-based tools
– User-friendly graphical interface and simple to use
– Extensive documentation, plus tutorials/technical support
• Reduces a large number of DE genes to a smaller
number of significantly enriched GO categories
– more easily interpreted in biological context
• Considering sets of genes increases power
– individual genes could be false positives but a set of functionally
related genes all showing significant changes is more robust
DAVID Results
Advantages
• Increasingly support data (probe IDs) from different
microarray platforms
• Accept various probe/gene identifiers
• Web-based tools automatically retrieve most up-todate GO annotations
• Most automatically map from probe IDs to a gene ID multiple significant probes for one gene could
otherwise skew results
Further considerations
• Reference list must be appropriate for
accurate statistical analysis
• Up/down regulated genes can be submitted
separately or as a combined list
• Unannotated genes cannot be used in the
analysis; gene ontology evolving; well-studied
systems over-represented
Gene set enrichment analysis
• Majority of tools based on idea of identifying
GO categories significantly enriched in list of
differentially expressed genes
• Requires some threshold to define genes as
‘significant’
• Recent tool called GSEA takes a different
approach by considering all assayed genes
GSEA: Key Features
• Ranks all genes on array based on their differential
expression
• Identifies gene sets whose member genes are
clustered either towards top or bottom of the ranked
list (i.e. up- or down regulated)
• Enrichment score calculated for each category
• Permutation test to identify significantly enriched
categories
• Extensive gene sets provided via MolSig DB – GO,
chromosome location, KEGG pathways, transcription
factor or microRNA target genes
GSEA
Disease Control
• Each gene category
tested by traversing
ranked list
• Enrichment score
starts at 0, weighted
increment when a
member gene
encountered,
weighted decrement
otherwise
Most significantly up-regulated genes
Unchanged genes
• Enrichment score –
point where most
different from zero
Most significantly down-regulated genes
GSEA algorithm
GSEA: Permutation Test
• Randomise data (groups), rank genes again and
repeat test 1000 times
• Null distribution of 1000 ES for geneset
Null distribution of
enrichment scores
Actual ES
• FDR q-value computed – corrected for gene set size
and testing multiple gene sets
Biological Interpretation
• Due to GO hierarchy, several related categories may
contain a subset of genes that is driving the significant
enrichment score so will all be significant
• Interpretation still requires substantial work
– search literature and public databases
– likely functional consequences of the changes
– are the genes identified as significant within each GO
category up- or down-regulated?
– genes within a category can have opposite effects e.g.
apoptosis would include genes that induce or repress
apoptosis
Biological Interpretation
• Too many categories found significant
– Size filter
– More stringent significance threshold
– Related categories (redundancy)
• No significant categories
– Relax significance level slightly
– e.g. 0.25 recommended by GSEA as exploratory
analysis
• No significant genes
– GSEA most suitable
Commercial Tool Suites
• Ingenuity Pathway Analysis (Ingenuity Systems, CA)
–
–
–
–
Developed own extensive ontology over past 10 years
Includes gene interactions, disease/drug information
PhD-level curators mining the literature
Used by many pharmaceutical companies
For more information
•
•
•
•
•
Gene Ontology: http://www.geneontology.org
Affymetrix: http://www.affymetrix.com
DAVID: http://david.abcc.ncifcrf.gov
GSEA: http://www.broad.mit.edu/gsea/
Ingenuity:
http://www.ingenuity.com/products/pathways_analysis.ht
ml
• NCBI: http://www.ncbi.nlm.nih.gov/