* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture 8 Annotating Gene Lists
Transposable element wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genetic engineering wikipedia , lookup
X-inactivation wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Gene therapy wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Metagenomics wikipedia , lookup
Oncogenomics wikipedia , lookup
Gene nomenclature wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Gene desert wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Pathogenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Public health genomics wikipedia , lookup
Essential gene wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome evolution wikipedia , lookup
Genomic imprinting wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genome (book) wikipedia , lookup
Designer baby wikipedia , lookup
Minimal genome wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9th February 2010 Overview • Interpreting microarray results – Gene lists to biological knowledge • The Gene Ontology Consortium – Defined terms to describe gene function • Functional analysis tools – Methods – DAVID/GSEA Microarray Pipeline Design and perform experiment Process and normalise data Statistical analysis Differentially expressed genes Biological interpretation Biological Interpretation • An obvious way to gain biological insight is to assess the differentially expressed genes in terms of their known function(s) • Required an automated and objective (statistical) approach • Functional profiling or pathway analysis Early functional analyses • Manually annotate list of differentially expressed (DE) genes • Extremely time-consuming, not systematic, userdependent • Group together genes with similar function • Conclude functional categories with most DE genes important in disease/condition under study • BUT may not be the right conclusion GO and functional analysis Immune response Metabolism Transcription Energy production Neurotransmission Protein transport Functional category Immune response Metabolism Transcription Energy production Neurotransmission Protein transport TOTAL Number of sig genes 40 20 20 10 5 5 100 Immune response category contains 40% of all significant genes - by far the largest category. Reasonable to conclude that immune response may be important in the condition being studied? However …. • What if 40% of the genes on the array were involved in immune response? • Only detected as many significant immune response genes as expected by chance • Need to consider not only the number of significant genes for each category, but also total number on the array Same example, relative to array Functional category Immune response Metabolism Transcription Energy production Neurotransmission Protein transport ALL Number of genes on array 8000 4000 2000 4000 200 1800 Actual number of significant genes 40 20 20 10 5 5 20000 100 Expected number of significant genes 40 20 10 20 1 9 Expected number of significant genes for category X = (num sig genes ÷ total genes on array)*(num genes in category X on array) Same example, relative to array Functional category Immune response Metabolism Transcription Energy production Neurotransmission Protein transport ALL Number of genes on array 8000 4000 2000 4000 200 1800 Actual number of significant genes 40 20 20 10 5 5 20000 100 Expected number of significant genes 40 20 10 20 1 9 • Now, transcription and neurotransmission categories appear more interesting as many more significant genes were observed than expected by chance • Largest categories are not necessarily the most interesting! Major bioinformatic developments • Requires annotating entire set of genes • The Gene Ontology Consortium (www.geneontology.org) • Automated, statistical approaches for annotating gene lists and performing functional profiling The Gene Ontology Consortium GO Consortium • Developed three structured and controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a speciesindependent manner • Has become a major resource for microarray data interpretation The Gene Ontology • Molecular Function: basic activity or task • Biological Process: broad objective or goal • Cellular Component: location or complex The Gene Ontology • Molecular Function: basic activity or task – e.g. catalytic activity, calcium ion binding • Biological Process: broad objective or goal – e.g. signal transduction, immune response • Cellular Component: location or complex – e.g. nucleus, mitochondrion GO Structure • Hierarchical tree • Annotated with most specific annotation, forming path to top of tree • Genes annotated with all relevant terms • Annotations based on published studies and also electronic inferences GO Terms • GO ID: GO:0007268 • GO term: synaptic transmission • Ontology: biological process • Definition: The process of communication from a neuron to a target (neuron, muscle, or secretory cell) across a synapse Graphical view http://www.ncbi.nlm.nih.gov/sites/entrez Functional Profiling Tools Functional profiling tools Identify GO categories with significantly more DE genes than expected by chance (i.e. overrepresented among DE genes relative to representation on array as a whole) Hypergeometric Distribution or Fisher’s Exact Test Correct for testing multiple GO categories Functional profiling tools Khatri and Draghici. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics (2005) 21(18):3587-95 Functional profiling tools • Freely-available stand-alone/web-based tools – User-friendly graphical interface and simple to use – Extensive documentation, plus tutorials/technical support • Reduces a large number of DE genes to a smaller number of significantly enriched GO categories – more easily interpreted in biological context • Considering sets of genes increases power – individual genes could be false positives but a set of functionally related genes all showing significant changes is more robust DAVID Results Advantages • Increasingly support data (probe IDs) from different microarray platforms • Accept various probe/gene identifiers • Web-based tools automatically retrieve most up-todate GO annotations • Most automatically map from probe IDs to a gene ID multiple significant probes for one gene could otherwise skew results Further considerations • Reference list must be appropriate for accurate statistical analysis • Up/down regulated genes can be submitted separately or as a combined list • Unannotated genes cannot be used in the analysis; gene ontology evolving; well-studied systems over-represented Gene set enrichment analysis • Majority of tools based on idea of identifying GO categories significantly enriched in list of differentially expressed genes • Requires some threshold to define genes as ‘significant’ • Recent tool called GSEA takes a different approach by considering all assayed genes GSEA: Key Features • Ranks all genes on array based on their differential expression • Identifies gene sets whose member genes are clustered either towards top or bottom of the ranked list (i.e. up- or down regulated) • Enrichment score calculated for each category • Permutation test to identify significantly enriched categories • Extensive gene sets provided via MolSig DB – GO, chromosome location, KEGG pathways, transcription factor or microRNA target genes GSEA Disease Control • Each gene category tested by traversing ranked list • Enrichment score starts at 0, weighted increment when a member gene encountered, weighted decrement otherwise Most significantly up-regulated genes Unchanged genes • Enrichment score – point where most different from zero Most significantly down-regulated genes GSEA algorithm GSEA: Permutation Test • Randomise data (groups), rank genes again and repeat test 1000 times • Null distribution of 1000 ES for geneset Null distribution of enrichment scores Actual ES • FDR q-value computed – corrected for gene set size and testing multiple gene sets Biological Interpretation • Due to GO hierarchy, several related categories may contain a subset of genes that is driving the significant enrichment score so will all be significant • Interpretation still requires substantial work – search literature and public databases – likely functional consequences of the changes – are the genes identified as significant within each GO category up- or down-regulated? – genes within a category can have opposite effects e.g. apoptosis would include genes that induce or repress apoptosis Biological Interpretation • Too many categories found significant – Size filter – More stringent significance threshold – Related categories (redundancy) • No significant categories – Relax significance level slightly – e.g. 0.25 recommended by GSEA as exploratory analysis • No significant genes – GSEA most suitable Commercial Tool Suites • Ingenuity Pathway Analysis (Ingenuity Systems, CA) – – – – Developed own extensive ontology over past 10 years Includes gene interactions, disease/drug information PhD-level curators mining the literature Used by many pharmaceutical companies For more information • • • • • Gene Ontology: http://www.geneontology.org Affymetrix: http://www.affymetrix.com DAVID: http://david.abcc.ncifcrf.gov GSEA: http://www.broad.mit.edu/gsea/ Ingenuity: http://www.ingenuity.com/products/pathways_analysis.ht ml • NCBI: http://www.ncbi.nlm.nih.gov/