* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Gene Set Enrichment Analysis
X-inactivation wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Oncogenomics wikipedia , lookup
Copy-number variation wikipedia , lookup
Minimal genome wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Metagenomics wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Genetic engineering wikipedia , lookup
Pathogenomics wikipedia , lookup
Ridge (biology) wikipedia , lookup
Public health genomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genomic imprinting wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Gene therapy wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome evolution wikipedia , lookup
Epigenetics of human development wikipedia , lookup
The Selfish Gene wikipedia , lookup
Gene desert wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Gene nomenclature wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Gene expression programming wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi What, Why, How… • • • • • • Gene expression data/analysis Problems with gene expression data analysis Earlier solutions My solution Comparisons Conclusions / Warnings Genome-wide gene expression • Genome-wide Gene Expression (GE) analysis. Standard lab tool • Various methods • Aim to understand biological differences across the samples at gene level • If you don’t work with GE data: – Gene Set Methods can be used with most other large scale data sets Typical pipelines Generate the GE data Generate the GE data Generate the GE data Pre-processing (Normalization etc.) Pre-processing (Normalization etc.) Pre-processing (Normalization etc.) Define Differentially Expressed genes Define Differentially Expressed genes Define Differentially Expressed genes Draw biological conclusions Find over-represented biological processes Cluster selected genes Generate a classification of samples using GE profiles of genes Draw biological conclusions Draw biological conclusions Classify unknown samples What can go wrong? • Is the definition of Differentially Expressed genes always reasonable? – datasets with large noise levels – p-value thresholds – sudden jump to signif. regulation – genes with weak regulation • Is the set of Diff. Expr. genes the main goal? What can go wrong? • Is the definition of Differentially Expressed genes always reasonable? – datasets with large noise levels – p-value thresholds – genes with weak regulation • Is the set of Diff. Expr. genes the main goal? => Biological Processes are usually more informative. What can go wrong? Analysis of data with one threshold. Biological process with weak regulation goes unnoticed Solution • Analyze sets of genes instead of genes • Gene Set: Genes belonging to same pathway, biological process, complex and/or Gene Ontology class • Benefits: Group of genes is less sensitive to error than a single gene* • Benefits: Easy interpretation of the results • Something to support the gene based analysis Gene set analysis pipeline Pre-processing (Normalization etc.) Gene level Define continuous Diff. Expr. score for genes Generate permuted data Gene set level Pre-defined gene sets Calculate a gene set score for each gene set Calculate the gene set score for each gene set Look for gene sets that show stronger signal in real data than in permuted data Class data Generate the GE data Expression data Sample labels Methods for gene set scoring • Average based methods • Rank based methods • Other methods (omitted here) Average based methods • Calculate the average regulation of gene set (Tian et al. PNAS) • Can something go wrong with it? Rank based methods • Steps: Gene expression data Analyzed gene classes Analyzed – order genes with differential subset expression – test every possible threshold in threshold the ordered list – look over(/under)-representation of gene set above the threshold – select the strongest score • Expression values are (often) discarded! • Iterative Group Analysis, Kolmogorov-Smirnov test (KS), modified KS (Gene Set Enrichment Analysis package, MIT) Black = class member White = not a member Permutations • Needed to evaluate significance • Two types: • Row Randomization – mix labels gene set / gene class • Column Randomization Row rand. – mix sample labels, used to calculate diff. expr. • Column Randomization preferred Col. rand Summary of methods • Average-based methods are weak with noncoherent regulation • Rank-based methods usually omit gene expression values => steps between all genes equally significant My brilliant proposal • Combine two method groups: – Order genes with diff. expr. scores – Test every threshold position – At each threshold calculate • Scale the difference with STD and average estimates (Toronen et al. 2009) • Get a Z-score scaling for difference => Gene Set Z-score (GSZ) My brilliant proposal • An over-representation (hypergeometric) score weighted with diff. expr. score • GSZ compares the Diff to the mean and STD we obtain when the class is randomly distr. in the ordered list. • Considers both: Variance in the expr. values and variance in the number gene set members in the list My brilliant proposal • Many popular Gene Set scoring methods are variants of GSZ-method: – hypergeometric testing – Pearson correlation – Max-Mean (Efron, Tibshirani) – Random Sets (Newton et al.) GSZ profile from ALL data (Chiaretti et.al) for one GO class vs. 7 quantiles (0, 5, 25, 50, 75, 95, 100) from 500 permutations. Different positions corresponds to other competing methods. Evaluation GSZ with diff. parameter values. Third box shows default parameter values. • • • • • Stability of the scores as threshold goes through the gene list? Red line: Strongest signal from positive data (across all GO classes) Blue lines: various quantiles (same as before) across all GO class Compare with KS and modified KS (Right column. MIT, PNAS and Nature Gen.) Same data, same permutation!! Pay attention to stability of blue lines. More evaluation • GSZ is also stable against the gene set size variations – most methods are not • Several Gene Set scoring methods were tested with artificial positive and random datasets – GSZ showed best overall ability to separate two dataset types • Methods were evaluated by splitting the real data to two halves: Test how well the results match – GSZ was best in predicting its own results from the other half – GSZ was best in predicting summary of all methods from the other half More evaluation • Compare different gene set scoring functions • Test with two popular datasets against GO classes • Calculate the empirical -log(p-values) for strongest GO classes from each method • Blue line = GSZ, green line = T-test, red = KS, magenta = iGA, cyan = modified KS Class data p53 dataset ALL dataset Pooled data More evaluation • Select biologically relevant GO classes as biologically positive • Look how many such classes each method finds across the top ranks (GSZ = blue line) Here ALL dataset. GSZ outperforms others at bigger ranks. Similar results were obtained with p53 dataset Comparison with other programs • Selected SignalPathway (green line), GSEA (cyan) and GSA (black) to comparison • Evaluation was done again using the biologically positive classes • Comparing programs less clear (more variables) Here again ALL dataset. Similar results with p53 GSZ outperforms others at large Summary • GSZ, weighted over-representation score • Math link to many other popular methods • Stable across GO class sizes and across gene list positions • Good performance in artificial datasets • Best performance with many evaluations from two real datasets Other applications • siRNA data vs. gene IDs (discussed) • Linkage data vs. biological processes (discussed) • BLAST result list vs. descriptions (in usage) • BLAST result list vs. GO classes (in usage) Warnings • Quality of gene expression data • Enough samples for permutations • Each gene should occur only once in the expression data • Filter genes without annotations (with GO data) • Use Column Permutations • Quality of gene sets / annotations Wake up!!