* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Introduction: - Statistical Science
Transposable element wikipedia , lookup
Oncogenomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Point mutation wikipedia , lookup
Pathogenomics wikipedia , lookup
Minimal genome wikipedia , lookup
X-inactivation wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Genetic engineering wikipedia , lookup
Public health genomics wikipedia , lookup
Ridge (biology) wikipedia , lookup
Copy-number variation wikipedia , lookup
History of genetic engineering wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Genomic imprinting wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Genome evolution wikipedia , lookup
Gene therapy wikipedia , lookup
Helitron (biology) wikipedia , lookup
Nutriepigenomics wikipedia , lookup
The Selfish Gene wikipedia , lookup
Gene desert wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome (book) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene expression programming wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Supplementary Material for: GSEA: A Gene Set Approach to Analyzing Molecular Profiles Author List February 7, 2005. Contents of Supplemental Material: 1 2 3 4 5 6 7 1 Additional figures and tables for examples. ................................................... 1 Detailed description of the GSEA method. .................................................. 12 2.1 Description of complementary statistics................................................ 15 2.2 Description of GSEA output. ................................................................. 16 2.3 Theoretical properties of gene tag and sample label permutation. ....... 19 Additional applications of GSEA. ................................................................. 21 3.1 Diabetes. .............................................................................................. 23 3.2 Downs................................................................................................... 24 Summary of top enrichment results for examples in paper .......................... 25 4.1 Gender S1 ............................................................................................ 25 4.2 Gender S2 ............................................................................................ 26 4.3 P53 S2 .................................................................................................. 27 4.4 P53 S3 .................................................................................................. 28 4.5 Leukemia S1......................................................................................... 29 4.6 Leukemia S2......................................................................................... 30 4.7 Lung A S2. ............................................................................................ 31 4.8 Lung B S2. ............................................................................................ 32 Defining gene sets and gene set databases. ............................................... 33 Running GSEA with the GSEAPACK R package. ....................................... 34 Running GSEA under GenePattern. ............................................................ 35 Additional figures and tables for examples. This section includes supplemental figures SF0-SF2 and tables ST1-ST5 not included in the main body of the paper. 6/27/17 page 1 478163561 Figure SF0. This figure compares the empirical null distribution for 5 selected gene sets (Diabetes example) before and after scaling normalization. This normalization is accomplished by dividing each null (and observed) ES score by the mean of the positive or negative scores for that gene set according to their sign. This procedure appropriately aligns the null distributions for gene sets of different sizes, prior to multiple hypotheses testing, and is motivated by the asymptotic multiplicative scaling of the Kolmogorov-Smirnov distribution as a function of size. 6/27/17 page 2 478163561 Error! Figure SF1. This figure compares the empirical null and observed distributions in the Diabetes example for a randomly generated collection of 1000 gene sets (top) and the functional gene sets (S2 database) before and after normalization (i.e., area under positive and negative density distributions equal to one). The random gene sets (top) obtain roughly equal numbers of positive and negative enrichment scores. Thus, the separate normalization of positive and negative scores makes little difference. In contrast, when the S2 gene sets are used (bottom) a larger number of sets attain negative scores. This reflects the fact that the behavior of curated and experimental gene sets is not necessarily balanced for all phenotypes. The independent normalization of positive and negative scores helps to reduce this natural imbalance when comparing observed and null score distributions. Similar imbalance can be produced by the data itself, e.g., when the distribution of genes, whose expression is positively and negatively correlated with phenotype, is unbalanced. 6/27/17 page 3 478163561 Figure SF3. This figure shows the enrichment plots for the chr21 gene set in the Downs syndrome dataset using un-weighted (p=0), weighted (p=1) and over-weighted (p=2) enrichment statistics. We used GSEA to analyze gene expression profiles from bone marrow of individuals with Downs syndrome (DS, n=14) and control individuals (n=25) [Aravind add ref]. When we probe the dataset with GSEA and the un-weighted p=0 statistic using the set of all the 243 genes on chr21, we find a small enrichment signal. For p=1 the enrichment statistic is much higher but the set is still not significant when adjusting for multiple hypothesis testing (FDR = .8). For p=2 the set achieves an even higher score and significance (FDR<0.25). The strength of the chr21 signal in this data set is carried by only about 20% of the genes in the set and thus requires boosting of their contribution to the score via the squaring of the correlation (p=2). 6/27/17 page 4 478163561 DATASET: BOSTON PHENOTYPE: Non-responders vs Responders GENE SET: ANN ARBOR ‘non-responders’ ES = 0.54 NP = 0.001 DATASET: BOSTON PHENOTYPE: Non-responders vs Responders GENE SET: ANN ARBOR ‘responders’ ES = 0.51 NP = 0.015 Figure SF2a. Using the signatures of the top 100 gene markers of clinical response from the Ann Arbor lung dataset to define a GSEA query gene sets for ‘responders’ and ‘non-responders’, we assessed their enrichment in the Boston lung dataset. The figure contains the plot of the running enrichment score, the maximum ES score and its corresponding nominal p-value. 6/27/17 page 5 478163561 DATASET: ANN ARBOR PHENOTYPE: Non-responders vs Responders GENE SET: BOSTON ‘non-responders’ ES = 0.59 NP < 0.001 DATASET: ANN ARBOR PHENOTYPE: Non-responders vs Responders GENE SET: BOSTON ‘responders’ ES = 0.48 NP = 0.005 Figure SF2b. Using the signatures of the top 100 gene markers of clinical response from the Boston lung dataset to define a GSEA query gene sets for ‘responders’ and ‘non-responders’, we assessed their enrichment in the Ann Arbor lung dataset. The figure contains the plot of the running enrichment score, the maximum ES score and its corresponding nominal p-value. 6/27/17 page 6 478163561 Table ST1 Gender Dataset Marker Genes There are 6 significant marker genes at 5% in Females GENE Hs.83623 Hs.83623 FAM16AX EIF1A DDX3X 216342_x_at LOCATION chrXq13.2 chrXq13.2 chrXp22.32 chrXp22.12 chrXp11.4 chr20p13 DESCRIPTION Homo sapiens cDNA: FLJ21545 fis, clone COL06195 Homo sapiens cDNA: FLJ21545 fis, clone COL06195 Family with sequence similarity 16, member A, X-linked Eukaryotic translation initiation factor 1Aeukaryotic translation init. factor 1A DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, X-linked na There are 12 significant marker genes at 5% in Males GENE RPS4Y DDX3Y SMCY EIF1AY EIF1AY USP9Y DDX3Y CYorf15B USP34 Hs.433656 C1orf34 RAP1GA1 6/27/17 page 7 478163561 LOC chrYp11.31 chrYq11.21 chrYq11.222 chrYq11.222 chrYq11.222 chrYq11.21 chrYq11.21 chrYq11.222 chr2p15 na Chr 1 Chr 1p36.1 DESC Ribosomal protein S4, Y-linked DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked Smcy homolog, Y-linked (mouse) Eukaryotic translation initiation factor 1A, Y-linked Eukaryotic translation initiation factor 1A, Y-linked Ubiquitin specific protease 9, Y-linked (fat facets-like, Drosophila) DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked Chromosome Y open reading frame 15B Ubiquitin specific protease 34 Homo sapiens mRNA; cDNA DKFZp434I143 (from clone DKFZp434I143) Chromosome 1 open reading frame 34 RAP1, GTPase activating protein 1 Table ST2 GENE SET Dataset: Lymphoblastoid cell lines SOURCE Enriched in Males s1:Testis expressed genes Experimental GNF Enriched in Females s2:Female reproductive tissue expressed genes s2:Proteasome degradation genes Experimental GNF GenMAPP ES NES NOM p-val FDR q-val 0.567 1.881 < 0.001 0.075 -0.434 -1.758 -0.594 -1.778 0.009 0.011 0.158 0.105 Table ST2 Results of the Gender example GSEA restricting the expression dataset to contain only autosomal genes. The testis germ cell gene set is still enriched in males and the uterus & ovarian expression gene set is still enriched in females. 6/27/17 page 8 478163561 Table ST3a Dataset / Phenotype # Common genes # of Genes in Dataset GSEA p-val Hypergeometric p-val Lung (Bhattacharjee et al) Non-responders vs. Responders 10 6645 0.015 2.06E-08 Melanoma Non-responders vs. Responders 7 8855 0.298 1.46E-05 Lymphoma (Shipp et al) Non-responders vs. Responders 6 8855 0.476 0.00012 GCM (Ramaswamy et al) Tumor vs. Normal 5 6645 0.11 0.0035 Lung (Garber et al) Non-responders vs. Responders 4 3997 0.016 0.0961 Breast Cancer (West et al) ER negative vs. ER positive 4 5391 0.197 0.0352 Astrocytoma (Khatua et al) High grade vs. low grade 4 6645 0.117 0.0161 Prostate (Febbo et al) Non-responders vs. Responders 3 8855 0.097 0.0254 Breast Cancer (vant Veer et al) Non-responders vs. Responders 3 6151 0.104 0.0600 Medulloblastoma (Pomeroy et al) Non-responders vs. Responders 2 5391 0.303 0.2740 Metastasis (Ramaswamy et al) Metastatic vs. Non-metastatic 1 6645 0.151 0.4380 Lymphoma (Monti et al) Non-responders vs. Responders 0 5391 0.512 0.8430 Assessment of marker overlap and GSEA enrichment of the Ann Arbor ‘responders’ gene set in other outcome related datasets. The target phenotype is shown in red. The overlap of the top 100 gene markers in the Ann Arbor ‘responders’ and the top 100 genes in each dataset distinction was quantified using a hypergeometric distribution and the p-value is reported in the last column. About half of the datasets show a significant overlap supported by a very small number of common marker genes. In contrast the only datasets passing the GSEA test, using the enrichment of the top 100 Ann Arbor ‘responders’ as a single gene set queried against the ranked list with respect to the relevant phenotype, are other lung outcome datasets. Table ST3b Dataset / Phenotype # Common Genes # of Genes in Dataset GSEA p-val Hypergeometric p-val Lung (Banerjee et al) Non-responders vs. Responders 8 6645 0.001 1.54E-05 Melanoma Non-responders vs. Responders 7 8855 0.104 1.46E-05 Astrocytoma (Khatua et al) High grade vs. low grade 6 6645 0.09 0.0006 Lung (Garber et al) Non-responders vs. Responders 5 3997 0.031 0.034 Breast Cancer (West et al) ER negative vs. ER positive 5 5391 0.112 0.009 Breast Cancer (Vant Veer et al) Non-responders vs. Responders 5 6151 0.134 0.005 GCM (Ramaswamy et al) Tumor vs. Normal 5 6645 0.413 0.0035 Medulloblastoma (Pomeroy et al) Non-responders vs. Responders 3 5391 0.313 0.108 Lymphoma (Shipp et al) Non-responders vs. Responders 2 5391 0.398 0.274 Metastasis (Ramaswamy et al) Metastatic vs. Non-metastatic 1 6645 0.411 0.438 Prostate (Febbo et al) Non-responders vs. Responders 1 8855 0.512 0.307 Lymphoma (Monti et al) Non-responders vs. Responders 0 8855 0.413 0.677 Assessment of marker overlap and GSEA enrichment of the Ann Arbor ‘non-responders’ gene set in other outcome related datasets. This is the same procedure as was used in table S3a but using the ‘nonresponders’ and corresponding phenotypes in the other datasets. 6/27/17 page 9 478163561 Table ST4 Responders GS ES NES NOM p-val FDR q-val Ann Arbor SIG_BCR_Signaling_Pathway -0.44916 -1.5619 0.03156 CR_IMMUNE_FUNCTION -0.47152 -1.508 0.06238 1 RAP_UP -0.33092 -1.4583 0.03759 0.98011 ST_ADRENERGIC -0.41027 -1.45 0.05776 0.7742 GLUCOSE_UP -0.40967 -1.4397 0.06847 0.65725 SIG_PIP3_signaling_in_B_lymphocytes -0.42587 -1.4023 0.07266 0.68813 cell_growth_and_or_maintenance -0.33859 -1.3096 0.1133 0.95426 nos1Pathway -0.40827 -1.2902 0.1493 0.91955 ST_GRANULE_CELL_SURVIVAL_PATHWAY -0.37556 -1.2863 0.1556 0.83271 ST_MONOCYTE_AD_PATHWAY -0.36374 -1.2613 0.1515 0.83686 Wnt_Signaling -0.33115 -1.2403 0.1617 0.83842 ST_Differentiation_Pathway_in_PC12_Cells -0.33406 -1.2206 0.1772 0.83504 ST_T_Cell_Signal_Transduction -0.37518 -1.2042 0.2451 0.82426 -0.34 -1.1747 0.2224 0.86419 biopeptidesPathway -0.31055 -1.1454 0.2481 0.90438 tcrPathway -0.31297 -1.1096 0.321 0.97407 ghPathway -0.33562 -1.0992 0.3223 0.95137 HTERT_DOWN -0.26974 -1.0941 0.3012 0.91663 mprPathway -0.33626 -1.075 0.3748 0.92824 ST_JNK_MAPK_Pathway -0.32634 -1.0729 0.3602 0.88796 SIG_BCR_Signaling_Pathway -0.47452 -1.7812 0.004107 0.51251 SIG_PIP3_signaling_in_B_lymphocytes -0.47245 -1.6285 0.02367 0.69835 EMT_DOWN -0.45166 -1.5797 0.0426 0.63364 ST_ADRENERGIC -0.43025 -1.5347 0.02828 0.63162 -0.4324 -1.5079 0.02887 0.59296 SIG_CD40PATHWAYMAP -0.40587 -1.4118 0.04555 0.8328 tcrPathway -0.36864 -1.3817 0.1229 0.84425 MAP00280_Valine_leucine_and_isoleucine_degradation -0.45691 -1.3811 0.1507 0.74114 ST_GRANULE_CELL_SURVIVAL_PATHWAY -0.38853 -1.348 0.1217 0.7743 cxcr4Pathway -0.38158 -1.3339 0.1118 0.74918 pdgfPathway -0.35555 -1.3225 0.138 0.7186 gpcrPathway -0.35006 -1.3163 0.1504 0.67642 MAP00380_Tryptophan_metabolism -0.37677 -1.309 0.1607 0.64555 ST_MONOCYTE_AD_PATHWAY -0.36001 -1.2833 0.1663 0.67407 MAP00361_gamma_Hexachlorocyclohexane_degradation -0.40236 -1.273 0.1956 0.65948 GPCRs_Class_A_Rhodopsin-like -0.35459 -1.2577 0.158 0.66074 CR_TRANSCRIPTION_FACTORS -0.30923 -1.2505 0.1795 0.64271 MAP00071_Fatty_acid_metabolism -0.36349 -1.2462 0.2104 0.61947 biopeptidesPathway -0.31535 -1.2427 0.1585 0.59507 MAP00350_Tyrosine_metabolism -0.40915 -1.2312 0.2667 0.59377 MAP00010_Glycolysis_Gluconeogenesis 0.49907 1.793 0.007353 0.27622 ceramidePathway 0.52954 1.724 0.008351 0.25968 INSULIN_2F_UP 0.48759 1.6979 0.01062 0.21673 HTERT_UP 0.47189 1.678 0.01724 0.18731 ST_Gaq_Pathway 1 Boston ST_Differentiation_Pathway_in_PC12_Cells Non-Responders Ann Arbor p53_signalling 0.3679 1.6644 0.006579 0.17044 CR_TRANSPORT_OF_VESICLES 0.50703 1.6579 0.01822 0.14963 breast_cancer_estrogen_signalling 0.37436 1.5762 0.00489 0.22292 Glycolysis_and_Gluconeogenesis 0.49041 1.5716 0.04825 0.20227 0.3946 1.5365 0.0437 0.23022 drug_resistance_and_metabolism 0.33507 1.5111 0.01911 0.24449 raccycdPathway 0.44724 1.475 0.05308 0.27751 MAP00240_Pyrimidine_metabolism 0.49394 1.4467 0.09534 0.29952 RAP_DOWN 0.41902 1.4202 0.1456 0.3243 PGC 0.33268 1.4087 0.09011 0.32372 CR_CELL_CYCLE 0.38487 1.4032 FRASOR_ER_DOWN 0.1188 0.31071 0.4861 1.3772 0.1717 0.33149 fmlppathway 0.40972 1.3672 0.09375 0.32909 vegfPathway 0.46551 1.3557 0.1293 0.32915 p38mapkPathway 0.35483 1.3555 0.08787 0.31204 LEU_DOWN 0.40499 1.3407 0.192 0.32247 0.43052 1.7897 0.006024 0.26871 0.5556 1.7463 0.03326 0.18368 Proteasome_Degradation Boston HTERT_UP Proteasome_Degradation INSULIN_2F_UP 0.4398 1.7012 0.01841 0.16595 0.42546 1.6223 0.06765 0.20881 0.4453 1.5376 0.09958 0.27468 mRNA_splicing 0.44383 1.5098 0.09877 0.27035 MAP00240_Pyrimidine_metabolism 0.41786 1.4369 0.118 0.34991 LEU_UP 0.32229 1.3734 0.09073 0.42796 HOXA9_UP 0.38558 1.3695 0.1127 0.38965 Glycolysis_and_Gluconeogenesis 0.41298 1.3166 0.1934 0.45523 RAP_DOWN 0.33294 1.293 0.1797 0.46433 PGC 0.27095 1.2576 0.191 0.49977 MAP00010_Glycolysis_Gluconeogenesis 0.34689 1.2511 0.1856 0.47527 CR_TRANSPORT_OF_VESICLES 0.34788 1.2205 0.2119 0.50251 FRASOR_ER_DOWN 0.26967 1.1163 0.3275 0.71557 GLUCOSE_UP 0.28339 1.1119 0.3172 0.68327 cell_motility 0.26254 1.1109 0.3111 0.64538 MAP00230_Purine_metabolism 0.26108 1.0585 0.3583 0.73718 GLUT_UP 0.23111 1.0581 0.4103 0.69916 0.275 1.0376 0.3992 0.71112 GLUT_DOWN LEU_DOWN P53_UP 6/27/17 page 10 478163561 The top 20 enriched pathways for both ‘responder’ and ‘nonresponder’ phenotypes using the GSEA with the S2 (functional) database against the Ann Arbor and Boston lung datasets. Two of the enriched pathways at FDR <0.25 are common on the non-responders side (telomerase and insulin 2F). Table ST5 Permutation Type Number of Gene Sets Enriched in Males (FDR<0.25) Number of Gene Sets Enriched in Females (FDR < 0.25) Sample Label Permutation 1 4 Gene Tag Permutation 2 38 Gene tag permutation ignores the gene-gene correlation structure in the dataset and can produce overly optimistic results when assessing significance. This may lead to too many sets passing an FDR cutoff of 0.25. For example, the table shows the differences between phenotype label and gene tag permutations for the Gender dataset example. The large number of gene sets passing the test using gene tag permutations (38) is likely to include many false positives. This is an extreme case. In general the gene tag permutation produces about twice the number of significant gene sets compared with phenotype label permutations for the same FDR threshold. Even though we do not recommend gene tag permutation as the default, it may be useful when the number of samples is too small to generate a sufficient number of phenotype label permutations. 6/27/17 page 11 478163561 2 Detailed description of the GSEA method. Inputs to GSEA. Expression dataset D with levels of N genes for k samples. A sorted gene list L and an associated vector R of gene-phenotype correlations; alternatively a ranking procedure to produce L (including a ranking metric M, e.g., t-test or signal to noise ratio, and 1) phenotype class vector or 2)profile of interest C). 1. C(i) = 0 or 1 according to phenotype of sample i (class vector) 2. C(i) = an expression level in sample i. An independently derived gene set G of Nh genes (e.g., genes in apathway or a cytogenetic band of interest) or an entire database or collection of NG gene sets. Results from GSEA. An enrichment score ES(G, L, R) that estimates the “enrichment” of G at the extremes of L. A nominal p-value that estimates the statistical significance of the ES. When a collection of gene sets is used, GSEA also produces cross-geneset normalized enrichments scores (NES), and two corrections to account for Multiple Hypotheses Testing in measuring statistical significance: Family Wise Error (FWER) and False Discovery Rate (FDR). Enrichment score calculation. Legend: ri = Correlation of gene gi with C using metric M. N = total number of genes in L. Nh = total number of genes in G. I(x) = indicator function equal to 1 if argument x is true and 0 otherwise. 1. Read input gene list L or rank order the N genes in D according to the correlation R of the genes’ expression profiles with C using metric M to form the list L = {g1,…,gN}. 2. Compute a random walk “running score” S where every “hit” (gene in G) increases the score by |rj|p / NR and every “miss” (gene not in G) decreases the score by 1/(N – Nh): [Note – if you want to subscript S let’s do it with p not W.] p N i rj p 1 S p ( G ,L,R,i ) I( g j G ) I( g j G ) , N R I( g j G ) rj NR N Nh j 1 j 1 1 4 4 2hit 4 43 1 4 4 4 2 4 4 4 3 miss 6/27/17 page 12 478163561 S p max S p ( G ,L,R,i ) i 1,...,N S p ES( G,L,R ) S p S p min S p ( G ,L,R,i ) i 1,...,N if S p S p if S p S p The ES score is the maximum deviation from zero scores. When p=0, the ES is identical to the Kolmogorov-Smirnov (KS) statistic that measures the difference between two cumulative distribution functions (in this case, the hits and the misses). For a randomly distributed G, ES(G, L, R) will be relatively small but a G with a non-random distribution may attain extreme values. This form of the statistic produces an un-weighted, rank-only based measure of enrichment, whichis simple and elegant but has the disadvantage of producing high scores for some sets with a concentration of hits in the middle of the list. In most cases these gene sets would be considered false positives. In addition, this form is not very sensitive when the enrichment of a gene set derives from a small subset of hits near the top or bottom of the list. For these reasons we view the p = 0 as a special case which is more useful if one is interested in detecting gene sets with general non-random distributions and one is willing to accept more false positives. In typical applications we seek gene sets, specifically enriched or concentrated at the top or bottom of the list. In this we set p = 1 (the default setting) where S is weighted by the correlation of the genes in the gene list and the scores better reflect enrichment produced by complete or partial differential expression of the gene set at the top or bottom of the list. In most of the biological datasets that we have studied this weighting scheme performed best at recovering known biological results at the same time reducing the number of false discoveries. In some selected cases, it may be more appropriated to set p = 2 to make the weight proportional to the square of the correlations. This is useful if one expects, for example, a potentially small subset of the gene set to be significantly enriched at the top or bottom of the list (e.g., see the Downs example). This setting is more sensitive to partial enrichment but at the expense of producing more false positives and treating the distribution of hits more unevenly according to the correlations. Estimating ES Significance. The significance of an observed ES( G ,L, R) for a gene set G is assessed by comparing it with the set of scores null G , for the same dataset with the phenotype labels randomly permuted. The null hypothesis is that the phenotypes are interchangeable and that the enrichment is produced by chance. 6/27/17 page 13 478163561 Compute a set of randomly permuted phenotype labels { C1 ,..., C , ..., C }. For each C reorder the gene list L to produce the corresponding L and R and compute a vector of corresponding ES values null G , ES( G,L , R ) . Estimate a standard nominal p-value for G by seeing how many values, positive or negative according to the sign of the observed ES( G ,L, R) , of null G , are equal or better than ES( G ,L, R) nominal p- val( G,L,R ) , # null G , ES( G,L,R ) # null G , 0 for positive ES( G ,L, R) and the corresponding expression for negative ones. Adjusting for Multiple Hypothesis Testing (FWER and FDR). Determine ES( G ,L, R) for each gene set in the collection or database . Compute a matrix null G , for all G and a set of permutations . Rescaling (row) normalization. Since the distribution of each row in the null G , matrix is gene set size dependent, the ES for each gene set we normalize them before adjusting for multiple hypotheses testing to obtainnormalized ES scores (NES). and We do this with multiplicative rescaling, We separately dividing the positive and negative null G , values for a given gene set G by their mean. The “observed” ES( G ,L, R) is also divided by the corresponding mean of the positive or negative null G , values according to its sign. This normalization procedure is very effective at making all the null distributions for different gene sets collapse into one. In addition to its empirical effectiveness, this procedure is theoretically motivated by the asymptotic multiplicative scaling of the KS distribution as a function of size (von Mises, R. 1964, Mathematical Theory of Probability and Statistics, New York Academic Press). The matrix of normalized random NES values is Gnull , Family Wise Error Rate (FWER). To compute the FWER for a given set µ we create a histogram of the maximum NES value for each random G permutation and use this distribution to compute how many extreme values are equal or better than the observed one: 6/27/17 page 14 478163561 µ ,L,R ) FWER p- val( G #max 0 µ ,L,R ) # max Gnull NES( G , G G null G , µ ,L, R) and the corresponding expression for negative for positive ES( G ones. False Discovery Rate (FDR). To compute the FDR q-values we compute the ratio of the null and observed (positive and negative score) CDF µ with positive NES this distributions for a given gene set. For a gene set G is given by: µ ,L,R ) # Gnull NES( G , G # 0 µ ,L,R ) FDR q- value( G G null k , µ ,L,R ) # NES( G,L,R ) NES( G G # NES( G,L,R ) 0 G , and the corresponding expression for gene sets with negative NES. Notice that the positive and negative sides are considered as independent CDFs and the counts are normalized accordingly. This normalization is particularly useful when the distributions of gene correlation or NES are skewed or unequal in terms of positive and negative entries. The final report of GSEA results includes a list of the gene sets sorted by their NES values and columns for their nominal and FWER p-values and FDR qvalues. The nominal p-values are not corrected for multiple testing and are usually quite optimistic. In contrast, FWER p-values tend to be more stringent and often yield no significant gene sets. For hypotheses generation and general use the FDR q-values may be more appropriate. We assess statistical significance using an FDR q-value threshold of 0.25 (corresponding to at most one out of four results being a false positive). 2.1 Description of complementary statistics. The GSEA program computes several additional statistics that may be useful to the sophisticated user: Tag %: The percentage of gene tags before (for positive ES) or after (for negative ES) the peak in the running enrichment score S. The larger the percentage, the more tags in the gene set contribute to the final enrichment score. 6/27/17 page 15 478163561 Gene %: The percentage of genes in the gene list L before (for positive ES) or after (for negative ES) the peak in the running enrichment score, thus it gives an indication of where in the list the enrichment is attained. Signal strength: The enrichment signal strength that combines the two previous statistics: (Tag %) x (1 – Gene %) x (N / (N - Nh). The larger this quantity the stronger the gene set as a whole. If the genes in gene set are in the first Nh positions in the list the signal strength is maximal or 1. If the genes are more spread out through the list the signal strength decreases towards 0. FDR (median): An additional FDR q-value computed by using a median null distribution. These values are in general more optimistic than the regular FDR qvalues as the median null is a representative of the typical random permutation null rather than extreme ones. For this reason, we don’t recommend it for common use. However, the FDR median is sometimes useful as a binary indicator function (zero vs. non-zero). When it is zero, it indicates that for those extreme NES values the observed scores are larger than the values obtained by at least half of the random permutations. One advantage of selecting gene sets in this manner (FDR median = 0) is that a predefined threshold is not required. In practice the gene sets selected in this way appear to be roughly the same as those for which the regular FDR is less than 0.25. For example in the Leukemia ALL/AML example the FDR median is zero for the top 5 sets (4 of which have FDR < 0.25). glob.p.val: A global nominal p-value for each gene set’s NES. This is estimated µ ,L,R ) , NES( G by computing the number of sets that are more extreme # Gnull , G than the observed value, in each random permutation. Notice that in this calculation, in contrast with the FDR, the number of observed scores larger than µ ,L,R ) ). that value is not used ( # NES( G,L,R ) NES( G G 2.2 Description of GSEAp output. The results of the GSEA are stored in the “output.directory” specified by the user as part of the input parameters to the GSEAp program. The results files are: Two tab-separated global results text files (one for each phenotype). These files are labeled according to the doc string prefix and the phenotype name from the CLS (class) file: <doc.string>.results.report.<phenotype>.txt One set of global plots. They include a) gene list correlation profile, b) global observed and null densities, c) heat map for the entire sorted dataset, and d) p-values vs. NES plot. These plots are in a single JPEG file named <doc.string>.global.plots.<phenotype>.jpg. When the program is run interactively these plots appear on a window in the R GUI. An 6/27/17 page 16 478163561 example of this set global plot for the Leukemia S1 dataset is shown in Fig. x. A variable number of tab-separated gene set results files according to how many sets pass any of the significance thresholds (“nom.p.val.threshold,” “fwer.p.val.threshold,” and “fdr.q.val.threshold”) and how many are specified in the “topgs” parameter. These files are named: <doc.string>.<gene set name>.report.txt. A variable number of gene set plots (one for each gene set report file). These plots include a) gene set running enrichment “mountain” plot, b) gene set null distribution and c) heat map for genes in the gene set. These plots are stored in a single JPEG file named <doc.string>.<gene set name>.jpg. The format (columns) for the global result files is as follows. GS : Gene set name. SIZE : Number of genes in the set. SOURCE : Set definition or source. ES : Enrichment score. NES : Normalized (multiplicative rescaling) normalized enrichment score. NOM p-val : Nominal p-value (from the null distribution of the gene set). FDR q-val : False discovery rate q-values FWER p-val : Family wise error rate p-values. Tag %: Percent of gene set before running enrichment peak. Gene %: Percent of gene list before running enrichment peak. Signal : Enrichment signal strength. FDR (median) : FDR q-values from the median of the null distributions. glob.p.val : P-value using a global statistic (number of sets above the given set’s NES). The rows are sorted by the NES values (from maximum positive or negative NES to minimum) The format (columns) for the individual gene set result files is as follows. # : Gene number in the (sorted) gene set PROBE_ID : The gene name or accession number in the dataset. SYMBOL : gene symbol from the gene annotation file. DESC : gene description (title) from the gene annotation file. LIST LOC : location of the gene in the sorted gene list. S2N : signal to noise ratio (correlation) of the gene in the gene list. RES : value of the running enrichment score at the gene location. 6/27/17 page 17 478163561 CORE_ENRICHMENT: is this gene is the “core enrichment” section of the list? Yes or No variable specifying if the gene location is before (positive ES) or after (negative ES) the running enrichment peak. The rows are sorted by the gene location in the gene list. The function call to GSEA returns a two element list containing the two global result reports as data frames ($report1, $report2). Fig. x. Global plots for the Leukemia S1 example. On the top right side there is the gene list correlation profile. On the top right side one can see the probability density for observed and null distribution. On the bottom left there is the heat map for the entire dataset sorted by signal to noise, and on the bottom right one can see a plot of the p-values vs. NES. 6/27/17 page 18 478163561 Fig x. Gene set plots for the 5q31 gene set in the Leukemia S1 example. The left plot is a running enrichment “mountain” plot that also shows the gene tags and the correlation profile (similar to the first plot in the global set); the one at the center is the probability density null distribution for the particular gene set. This plot shows the ES, NES and p-values for the set at the bottom of the plot. The right plot is a heat map for those genes in the gene set sorted by correlation to the phenotype. 2.3 Theoretical properties of gene tag and sample label permutation. In this section we compute some properties of the enrichment statistic under gene tag scrambling and sample label scrambling. We provide some average enrichment scores and give an indication of the effect of dependence between genes on a dataset. Tag scrambling: Given a rank ordered list of N genes, where N H of the genes belong to a gene set G , we define FG and FG C as the empirical distributions of the ranks of genes in the gene set and the ranks genes in the compliment respectively. Note that the compliment has N N H genes. In general N ? N H , however we do note make this assumption in the following computations. Our randomization procedure consists of randomly choosing N H genes as the gene set. We refer to this as “scrambling the tags”. The one-sided enrichment statistic for a given rank ordering is defined as follows (1.1) ES N, N H max FG i FGC i , 6/27/17 page 19 478163561 i1,..., N the two-sided statistic is the same except that it has a symmetric distribution about the origin. We now compute properties of the above statistic such as its distribution and expectation with respect to tag scrambling. The following quantitywill be important in characterizing properties of the enrichment statistic: n ( N NH ) NH N . When N ? N H , n is approximately N H . A reasonable approximation of the distribution function of the enrichment statistic for n 8 is 1 exp 2k n. Pr ES N , N H k 2 2 (1.2) k The number of terms required for the above series to converge depends on . As approaches zero, more terms are required. From the above equation, we can compute the following density function for the enrichment statistic p 4 1 k k 1 k 2 n exp 2k 2 2 n . (1.3) The number of terms required for this series to converge increases as decreases. The average enrichment score is simply the expectation with respect to the above density es E p ES N , N H 1 0 p d (1.4) erf 2nk 1 2 2 exp 2k n . 4 16 k n k ,k 0 Figure (plots append) displays the concentration of the enrichment density as N H increases for and N 7,000. The following table lists the average enrichment 4 1 k 1 score as a function of N , N H . N 1,000 1,000 1,000 1,000 7,000 7,000 7,000 7,000 20,000 20,000 20,000 20,000 Label scrambling: 6/27/17 page 20 478163561 NH 10 50 100 500 10 50 100 500 10 50 100 500 n 9.9 47.5 90.0 250.0 9.986 49.64 98.57 464.3 9.995 49.88 99.5 487.5 es 0.2761 0.1260 0.0916 0.0549 0.2749 0.1233 0.0875 0.0403 0.2748 0.1230 0.0871 0.0393 As described in the body of the paper, we estimate significance by permuting the class labels, rather than randomizing the members of the gene sets. This method of assessing significance preserves the correlation between genes and generally yields larger p-values since genes are dependent. We will use simulations rather than analytic results to compute the analogous quantities from the previous section, i.e., the density of the enrichment statistic and the average enrichment score. Note that these quantities will be data dependent. The enrichment score is now computed as a function of the dataset D, the gene set G, and a ranking procedure R, ES D,G, R . The null distribution of the enrichment score is computed using label permutations as described in section (methods) of the main body of the paper. We cannot analytically compute this distribution as we could for gene tag randomization, but we can numerically estimate it. The output of the label permutation procedure is a set of enrichment scores S ES G, D 1 , R ,....,ES G, D , R , computed over label permutations. We can approximate the density of the enrichment statistic as p histogram S . (1.5) The average enrichment score can be approximated as bin S bin es E p ES G, D, R # bins i i , (1.6) i1 where # bins is the number of bins in the histogram, bin i is the average value of the ith bin, and S bin i is the number of elements in the set of scores S with elements in the range of the ith bin. 3 Additional applications of GSEA. This section describes additional applications of GSEA that were omitted from the main text because of size restrictions. The first illustrates the use of GSEA to assess the enrichment of a single gene set to test a specific hypothesis. The second shows the detailed results of applying GSEA to the original diabetes dataset that was used in Mootha et al. 2003. The third example illustrates the use of GSEA in a Downs syndrome dataset where it is appropriate to use of overweighting (square of correlation, p=2) in the enrichment score. 3.1 Sonic hedgehog (Shh) pathway. In this example we use GSEA with a preselected gene set to assess its enrichment as part of a single hypothesis. The dataset consists of several medulloblastoma human samples. Medulloblastomas, the most common malignant brain tumor of childhood, have two generally accepted histological subclasses: desmoplastic and classic, whose differences 6/27/17 page 21 478163561 can be seen clearly under the microscope. Desmoplastic medulloblastomas have been linked to dysregulated signaling of the Sonic hedgehog (Shh) pathway by their occurrence in Gorlin’s syndrome, an autosomal dominant disorder due to germline mutations of the Shh receptor PTCH or SuFu a downstream member the pathway (Johnson RL, et al. Science 1996; 272:1668-1671; Hahn H, et al. Cell 1996; 85:841-851; Taylor et al., Nat Genet 2002; 31:306-10). This GSEA application is based on the earlier work in (Pomeroy et al. 2002) where a cluster of Shh-regulated genes was found to be among the most highly expressed marker genes significantly associated with sporadic desmoplastic medulloblastomas. The appearance of these genes implied that sporadic desmoplastic medulloblastomas, like Gorlin’s syndrome tumors, are characterized by activation of the Shh signaling pathway. This identification was done by a careful manual examination of highly differentially expressed genes. GSEA was used to evaluate the enrichment of Shh genes in the gene list ranked by correlation with the classic vs. desmoplastic distinction using Pomeroy et al. 2002 dataset B. The Shh gene set was defined by manual curation from the literature and from previous experimental results. This set of 21 probes was indeed found to be significantly associated with desmoplastic tumors (GSEA pval = 0.018). The enrichment results are shown in the figure SF3. Notice that not all the genes in the pathway show coordinated behavior but enough of them cluster at the top of the list to provide significant enrichment on the desmoplastic phenotype. Figure SF. GSEA results for the sonic hedgehog pathway (Shh) in medulloblastoma. This set of genes in enriched and attains a nominal p-value of 0.018. The enrichment results for SHH are shown in the table below. The full set of result files for this example can be found in the GSEA/Examples/PTCH folder. 6/27/17 page 22 478163561 3.2 Diabetes. This example presents the detailed results of applying the new GESA method described in this paper to the original dataset of Mootha et al. 2003. Type 2 diabetes mellitus is an complex human disease, with both genetic and environmental factors. Numerous pathways, such as insulin signaling, free fatty acid metabolism, glucose transport, and ATP production, have been implicated both in vitro and in vivo models of the disease. However, microarray studies of skeletal muscle, one of the major sites of insulin mediated glucose disposal, failed to reveal any consistent, robust insights into disease mechanisms. Skeletal muscle biopsies from diabetics and normal controls have not shown large differences in gene expression (Mootha et al. 2003). The original GSEA method was used to systematically interrogate the enrichment of a large collection (approximately 150) of functional gene sets in differentially expressed genes in 17 samples of skeletal muscle biopsies of patients with normal glucose tolerance (NGT) and 17 samples of diabetes mellitus (DM2). Using traditional single gene analysis no single gene was significantly differently expressed between these classes. As reported in Mootha et al. 2003 this result is consistent with previous studies of DM2 muscle. While no single gene may show significant expression differences, entire pathways might be different between these disease states and GSEA may be able to detect differential pathway enrichment. The new version of the GSEA was applied to this same dataset using the functional gene set database S2, which includes many of the 150 gene sets from the original paper, but also about 250 additional sets. Results are shown in the table below. There are two gene sets that pass the FDR < 0.25 threshold and are enriched in NGTs: VOXPHOS (oxidative phosphorylation, FDR = 0.08) and Electron Transport Chain (FDR = 0.05). This is consistent with the results from Mootha et al. 2003 and is striking because the members of the oxidative phosphorylation set show only a modest decrease (~15%) in DM2 vs. NGT normal controls as individual gene markers. However, from the perspective of the entire set, the difference is very strong. 87 out of the 112 members of the OxPhos pathway are diminished in DM2 relative to NGT. The new GSEA method continues to effectively detect this difference as reflected by the enrichment test. The following table shows the top enrichment results for the Diabetes dataset. The full set of result files for this example can be found in the GSEA/Examples/Diabetes_S2 folder. 6/27/17 page 23 478163561 3.3 Downs syndrome. Down syndrome was the first chromosomal disorder to have been clinically identified (Down London Hosp Clin Lect Rep 3:259, 1866). It is characterized by trisomy 21, and results in mental retardation, dysmorphic facies, and hypotonia. In this example, GSEA was applied to gene expression profiles obtained from bone marrow of 14 individuals with Down syndrome, as well as from 25 normal controls. We sought to determine whether “chromosomal gene sets” showed enrichment either in individuals with Down syndrome or in the controls. When we tested all autosomal as well as X and Y chromosomal gene sets, four sets were enriched in DS samples (see table below): chr21, chr21q21, chr21q22 chr7p21. Note that only the following bands are used to probe the data set as we restricted to sets with at least XXX genes. These results clearly indicate that the genes on chromosome 21 are more highly expressed in individuals with DS, compared to controls. The results are consistent with the gene dosage hypothesis (J Neurol. 2002 Oct; 249(10):1347-56), which suggests that DS results from a loss of dosage compensation (i.e., high expression of chromosome 21 genes). The enrichment of chr21 and some its cytogenetic bands are clearly at the top of the list but they do not achieve significance at FDR < 0.25 unless one uses the p=2 over-weighting parameter in the enrichment score. Entire chromosomes or large cytogenetic bands (e.g. chr21q22) are not likely to produce strong enrichment results due to the difficulty of producing coordinated expression behavior in such a large set of genes. (Note that that Y chromosome in the Gender example in the main body of the paper is an 6/27/17 page 24 478163561 exception because it is rather small and is also an all or nothing signal that produces overwhelming enrichment). In this situation, the over-weighting of the correlations at the top or bottom of the list can expose a subtle biological signal and the likelihood that such sets achieve significance. Thus, setting p=2 in the enrichment score can be a useful tool but should be used with caution as it can also produce undesirable false positives. This is the only example in this paper that requires the use of p=2. The following table shows the top enrichment results for the Downs dataset. The full set of result files for this example can be found in the GSEA/Examples/Downs_S1 folder. 4 4.1 Summary of top enrichment results for examples in paper Gender S1 6/27/17 page 25 478163561 The following table shows the top enrichment results for the Gender dataset using S1. The full set of result files for this example can be found in the GSEA/Examples/Gender_S1 folder. Enriched in Male GS chrY chrYq11 chrYp11 chr4q13 chr13q13 chr11q22 chr21q22 chr6q24 chr21 chr9q21 chr2q32 chr6q23 chr15q14 chr7q21 chr5p13 chr7q11 chr9p21 chr10p12 chr16q22 chrXq22 SIZE 67 27 27 79 35 44 198 35 243 68 57 51 26 97 72 79 42 39 118 60 SOURCE ES Chromosome Y -0.71693 Cytogenetic band -0.82603 Cytogenetic band -0.68959 Cytogenetic band -0.43339 Cytogenetic band -0.45164 Cytogenetic band -0.44014 Cytogenetic band -0.34127 Cytogenetic band -0.43521 Chromosome 21 -0.31274 Cytogenetic band -0.34658 Cytogenetic band -0.38534 Cytogenetic band -0.3667 Cytogenetic band -0.36987 Cytogenetic band -0.33056 Cytogenetic band -0.3143 Cytogenetic band -0.32827 Cytogenetic band -0.33074 Cytogenetic band -0.35044 Cytogenetic band -0.27433 Cytogenetic band -0.29782 NES NOM p-val FDR q-val FWER p-val Tag % Gene % Signal FDR (median) glob.p.val -2.3868 0 0 0.358 0.101 0.323 -2.2437 0 0 0.333 0.0074 0.331 -2.1303 0 0 0.481 0.165 0.403 -1.5385 0.01394 0.6026 0.266 0.122 0.234 -1.5205 0.03061 0.6286 0.314 0.159 0.265 -1.4724 0.05848 0.6927 0.364 0.216 0.286 -1.4524 0.02464 0.7147 0.369 0.225 0.288 -1.4252 0.06452 0.7477 0.286 0.204 0.228 -1.3836 0.02725 0.7808 0.346 0.225 0.271 -1.3476 0.07475 0.8158 0.338 0.208 0.269 -1.2976 0.1515 0.8539 0.281 0.173 0.233 -1.2561 0.2096 0.8929 0.216 0.0973 0.195 -1.218 0.2351 0.9009 0.615 0.344 0.404 -1.1911 0.2242 0.9109 0.247 0.169 0.206 -1.1796 0.2672 0.9139 0.319 0.198 0.257 -1.1769 0.2429 0.9149 0.266 0.185 0.217 -1.1763 0.2426 0.9159 0.333 0.171 0.277 -1.1453 0.2897 0.9299 0.333 0.197 0.268 -1.1296 0.2735 0.9339 0.153 0.0913 0.139 -1.1127 0.3247 0.9379 0.35 0.259 0.26 Enriched in Female GS chrXq13 chr6q15 chrXp22 chr12q23 chr2q14 chr2p11 chrXq24 chr12q22 chr11p11 chr2q31 chrXq21 chr13q14 chr1p13 chr12q13 chr5q15 chr1p21 chr3q29 chr13q22 chr12q15 chr1p32 SIZE 56 36 124 71 29 67 35 26 71 87 54 90 120 262 29 36 53 25 30 68 SOURCE ES NES NOM p-val FDR q-val FWER p-val Tag % Gene % Signal FDR (median) glob.p.val Cytogenetic band 0.57017 2.096 0 0 0.286 0.143 0.246 Cytogenetic band 0.50534 1.57 0.03373 0.5676 0.472 0.218 0.37 Cytogenetic band 0.35171 1.49 0.02107 0.6867 0.258 0.129 0.226 Cytogenetic band 0.40227 1.4586 0.06757 0.7247 0.352 0.17 0.293 Cytogenetic band 0.50669 1.4565 0.06827 0.7267 0.414 0.177 0.341 Cytogenetic band 0.40493 1.4491 0.05437 0.7327 0.224 0.104 0.201 Cytogenetic band 0.43249 1.3527 0.1235 0.8058 0.314 0.141 0.27 Cytogenetic band 0.43967 1.2993 0.18 0.8408 0.308 0.133 0.267 Cytogenetic band 0.32448 1.2677 0.1299 0.8579 0.211 0.13 0.184 Cytogenetic band 0.32082 1.2242 0.2239 0.8939 0.437 0.278 0.317 Cytogenetic band 0.34284 1.2091 0.2329 0.8999 0.352 0.219 0.275 Cytogenetic band 0.30353 1.171 0.2463 0.9179 0.3 0.196 0.242 Cytogenetic band 0.29615 1.1565 0.2871 0.9209 0.283 0.194 0.229 Cytogenetic band 0.26886 1.1522 0.2754 0.9229 0.359 0.239 0.276 Cytogenetic band 0.35382 1.1496 0.298 0.9249 0.31 0.113 0.275 Cytogenetic band 0.33477 1.1481 0.2945 0.9259 0.222 0.0554 0.21 Cytogenetic band 0.34099 1.1467 0.2925 0.9259 0.377 0.222 0.294 Cytogenetic band 0.37601 1.146 0.2979 0.9269 0.28 0.114 0.248 Cytogenetic band 0.34899 1.1352 0.2777 0.9309 0.433 0.221 0.338 Cytogenetic band 0.30976 1.1283 0.292 0.9339 0.353 0.252 0.265 4.2 Gender S2 The following table shows the top enrichment results for the Gender dataset using S2. The full set of result files for this example can be found in the GSEA/Examples/Gender_S2 folder. 6/27/17 page 26 478163561 4.3 P53 S2 The following table shows the top enrichment results for the P53 dataset using S2. The full set of result files for this example can be found in the GSEA/Examples/P53_S2 folder. 6/27/17 page 27 478163561 4.4 P53 S3 The following table shows the top enrichment results for the P53 dataset using S3. The full set of result files for this example can be found in the GSEA/Examples/P53_S3 folder. 6/27/17 page 28 478163561 4.5 Leukemia S1 The following table shows the top enrichment results for the Leukemia dataset using S1. The full set of result files for this example can be found in the GSEA/Examples/ALLAML_S1 folder. 6/27/17 page 29 478163561 4.6 Leukemia S2 The following table shows the top enrichment results for the Leukemia dataset using S2. The full set of result files for this example can be found in the GSEA/Examples/ALLAML_S2 folder. 6/27/17 page 30 478163561 4.7 Lung A S2. The following table shows the top enrichment results for the Lung A dataset using S2. The full set of result files for this example can be found in the GSEA/Examples/Lung_A_S2 folder. 6/27/17 page 31 478163561 4.8 Lung B S2. The following table shows the top enrichment results for the Lung B dataset using S2. The full set of result files for this example can be found in the GSEA/Examples/Lung_B_S2 folder. 6/27/17 page 32 478163561 5 Defining gene sets and gene set databases. GSEA can easily be used in combination with any ordering technique and any annotation or other gene set source. The selection of genes to include in a gene set depends on the question being asked. For example, to test for the presence of a growth factor signal transduction pathway, the gene set might include ligands, receptors, and known intermediate molecules that transmit the signal to the nucleus. Activation of a pathway can be assessed by including genes known to be transcriptionally regulated by the pathway. In all cases, some genes will be very unique to the pathway (e.g., PTCH and SuFu in the Shh pathway) whereas other genes will be more general (e.g., RAS and MAPK) and less likely to be differentially expressed across samples or conditions. Both general and specific genes can be included, although genes with low specificity for the pathway will potentially lower the sensitivity of the test. Gene Sets can be culled from Gene Ontology ( ), from compilations of pathways such as KEGG ( ), GenMAPP ( ), Humancyc ( ) and CGAP ( ) or sequence databases such as TRANSFAC ( ). Gene sets can also be identified from a group of genes clustered together (i.e., co-expressed) in an experiment, genes previously implicated in disease pathophysiology, genes in the same cytogenetic band, etc. In some studies there may be limited previously curated information about pathways or biological processes. In other cases, one may want to build a systematic database of gene sets that represents biological processes relevant to a large class of biological systems (e.g., tumors of many types). In both cases, it is very helpful to computationally define gene sets according to an analysis algorithm that extracts relevant molecular signatures from a large gene expression compendium. For the purposes of this paper we built a collection of databases of gene sets that can be used to probe microarray data sets: Database S1 (chromosomal location): This database consists of 24 sets corresponding to the genes on each of the 24 human chromosomes, as well as 301 sets corresponding to cytogenetic bands. This database can be helpful in identifying effects related to epigenetic silencing, dosage compensation, copy number polymorphisms, and aneuploidy or other chromosomal deletions/amplifications. Database S2 (functional): This database includes 475 metabolic and signaling pathways gleaned from 8 publicly available manually curated databases. In addition, there are 51 sets representing gene expression signatures of genetic and chemical perturbations that have been culled from experimental results in the literature. Database S3 (motif-based): Each set contains genes that lie downstream of a motif that is conserved across the human, mouse, rat, and dog 6/27/17 page 33 478163561 genomes. The motifs are catalogued in [Xiohue Xie, et al.] and represent known or likely regulatory elements in promoters and 3’-untranslated regions. Database S4 (correlated): Correlation gene sets are groups of genes defined by computationally mining large-scale experimental datasets for co-expressed genes. As some versions of these databases were built at different times according to where the analysis for each example was performed, we provide the specific versions named with the example where they were used. This allows full reproducibility of each example. Up to date microarray specific, “canonical” versions of these databases are also distributed with the GSEA software and those are the ones recommended for use in new examples and applications. In addition we are in the process of creating a web site where these databases will be able to be created and downloaded on a continuous basis. 6 Running GSEA with the GSEAPACK R package. The GSEA program is provided in this paper’s web site in two ways: as a standalone R package including documentation (GSEAPACK-1.0.zip), and as an analysis module in the GenePattern environment (ref). There is also a zip file (GSEA.Examples.zip) that contains all the data, R scripts and results of the examples described in the paper. Running the R package: These are the instructions to run GSEA in your machine. You need to install release 2.0 or later of R. Copy the GSEAPACK-1.0.zip file to your computer. Install the GSEAPACK-1.0.zip package in your R environment by running the Rgui and then clicking on “install packages(s) from local zip files” in the “packages” menu. Once this is completed type “library()” in the R prompt and you should see a list of packages including an entry for GSEAPACK. Type "library(help=GSEAPACK)" to see all the functions including in the package. To run GSEA as a user you will typically only call the GSEA() main function. To load the package type “library(“GSEAPACK”) and then “help(“GSEA”). This opens the documentation page for the main GSEA function. Now you can run a demo run of the code by typing “demo(allaml.demo)”. This will execute a short run (a few random permutations) of the ALL/AML example. It will take a few minutes and it should produce the outputs describe in the “description of the GSEA output” earlier in this document. This short run is intended only as a short demo and to reproduce the results reported in the Leukemia Example section of this paper one has to run 1000 permutations which will take over an hour of CPU time. 6/27/17 page 34 478163561 If the package installation fail don’t panic; you can still try to run the code from raw source files as will be described below. If you are ready to run the examples do the following: Copy the file GSEA.Examples.zip to your machine. Expand the zip file in a location of your file system of choice (check that the option to expand subfolders is active). In that location a tree of subdirectories should be created: GSEA/Examples/ (R scripts and one folder for each example: ALLAML_S1 etc.) GSEA/method/ GSEA.R (the R program) GSEA/AnnotationFiles/ (Affymetrix annotation files, e.g. ) GSEA/GeneSetDatabases/ (gene set databases e.g. s1.allaml.genesetdb.gmt ) GSEA/GSEAPACK-1.0.zip (copy of the GSEAPACK R package) The R scripts that run each individual example are under GSEA/Examples. For example the script “Run.ALLAML_S1.R” runs the Leukemia example. Before running this file (e.g. by cutting and pasting it into the RGui window) make sure you modify the file pathnames to be consistent to the location in your file system where you expanded the zip file and created the GSEA examples’ subfolders. These scripts load the GSEA program by performing a “library(“GSEAPACK”) call. If for any reason you had problems installing or loading the GSEAPACK package you can try to run the scripts in such way that they load the R source program from GSEA/method/GSEA.R rather than from the installed package. All you need to do is to comment out the “library(“GSEAPACK”) line (put a “#” in front of it) and un-comment the two lines of code below: “GSEA.program.location…” and “source(…..)”. If you do this make sure you modify the pathname to the GSEA.R location too. When you run those scripts you should obtain the same identical results as reported in this document and included in the GSEA/Examples subfolders (the random number generator seeds are explicitly set). If you overwrite the result files when you run your version of the scripts you can always get a copy of the originals from the zip file. If you want to run a new dataset with GSEAPACK the easiest way is to create a new directory under GSEA/Examples/<my dataset> and then copy and modify for example Run.ALLAML_S1.R to point to that directory and use the right files. 7 Running GSEA under GenePattern. 6/27/17 page 35 478163561