Download Introduction: - Statistical Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Transposable element wikipedia , lookup

Oncogenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Point mutation wikipedia , lookup

Pathogenomics wikipedia , lookup

Minimal genome wikipedia , lookup

X-inactivation wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

Epistasis wikipedia , lookup

Copy-number variation wikipedia , lookup

NEDD9 wikipedia , lookup

History of genetic engineering wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Genomic imprinting wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Gene therapy wikipedia , lookup

RNA-Seq wikipedia , lookup

Helitron (biology) wikipedia , lookup

Nutriepigenomics wikipedia , lookup

The Selfish Gene wikipedia , lookup

Gene desert wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome (book) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene expression programming wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Designer baby wikipedia , lookup

Transcript
Supplementary Material for:
GSEA: A Gene Set Approach to Analyzing Molecular Profiles
Author List
February 7, 2005.
Contents of Supplemental Material:
1
2
3
4
5
6
7
1
Additional figures and tables for examples. ................................................... 1
Detailed description of the GSEA method. .................................................. 12
2.1
Description of complementary statistics................................................ 15
2.2
Description of GSEA output. ................................................................. 16
2.3
Theoretical properties of gene tag and sample label permutation. ....... 19
Additional applications of GSEA. ................................................................. 21
3.1
Diabetes. .............................................................................................. 23
3.2
Downs................................................................................................... 24
Summary of top enrichment results for examples in paper .......................... 25
4.1
Gender S1 ............................................................................................ 25
4.2
Gender S2 ............................................................................................ 26
4.3
P53 S2 .................................................................................................. 27
4.4
P53 S3 .................................................................................................. 28
4.5
Leukemia S1......................................................................................... 29
4.6
Leukemia S2......................................................................................... 30
4.7
Lung A S2. ............................................................................................ 31
4.8
Lung B S2. ............................................................................................ 32
Defining gene sets and gene set databases. ............................................... 33
Running GSEA with the GSEAPACK R package. ....................................... 34
Running GSEA under GenePattern. ............................................................ 35
Additional figures and tables for examples.
This section includes supplemental figures SF0-SF2 and tables ST1-ST5 not
included in the main body of the paper.
6/27/17 page 1
478163561
Figure SF0. This figure compares the empirical null distribution for 5 selected
gene sets (Diabetes example) before and after scaling normalization. This
normalization is accomplished by dividing each null (and observed) ES score by
the mean of the positive or negative scores for that gene set according to their
sign. This procedure appropriately aligns the null distributions for gene sets of
different sizes, prior to multiple hypotheses testing, and is motivated by the
asymptotic multiplicative scaling of the Kolmogorov-Smirnov distribution as a
function of size.
6/27/17 page 2
478163561
Error!
Figure SF1. This figure compares the empirical null and observed distributions in
the Diabetes example for a randomly generated collection of 1000 gene sets
(top) and the functional gene sets (S2 database) before and after normalization
(i.e., area under positive and negative density distributions equal to one). The
random gene sets (top) obtain roughly equal numbers of positive and negative
enrichment scores. Thus, the separate normalization of positive and negative
scores makes little difference. In contrast, when the S2 gene sets are used
(bottom) a larger number of sets attain negative scores. This reflects the fact that
the behavior of curated and experimental gene sets is not necessarily balanced
for all phenotypes. The independent normalization of positive and negative
scores helps to reduce this natural imbalance when comparing observed and null
score distributions. Similar imbalance can be produced by the data itself, e.g.,
when the distribution of genes, whose expression is positively and negatively
correlated with phenotype, is unbalanced.
6/27/17 page 3
478163561
Figure SF3. This figure shows the enrichment plots for the chr21 gene set
in the Downs syndrome dataset using un-weighted (p=0), weighted (p=1)
and over-weighted (p=2) enrichment statistics. We used GSEA to analyze
gene expression profiles from bone marrow of individuals with Downs
syndrome (DS, n=14) and control individuals (n=25) [Aravind add ref].
When we probe the dataset with GSEA and the un-weighted p=0 statistic
using the set of all the 243 genes on chr21, we find a small enrichment
signal. For p=1 the enrichment statistic is much higher but the set is still
not significant when adjusting for multiple hypothesis testing (FDR = .8).
For p=2 the set achieves an even higher score and significance
(FDR<0.25). The strength of the chr21 signal in this data set is carried by
only about 20% of the genes in the set and thus requires boosting of their
contribution to the score via the squaring of the correlation (p=2).
6/27/17 page 4
478163561
DATASET: BOSTON
PHENOTYPE: Non-responders vs Responders
GENE SET: ANN ARBOR ‘non-responders’
ES = 0.54
NP = 0.001
DATASET: BOSTON
PHENOTYPE: Non-responders vs Responders
GENE SET: ANN ARBOR ‘responders’
ES = 0.51
NP = 0.015
Figure SF2a. Using the signatures of the top 100 gene markers of clinical
response from the Ann Arbor lung dataset to define a GSEA query gene
sets for ‘responders’ and ‘non-responders’, we assessed their enrichment
in the Boston lung dataset. The figure contains the plot of the running
enrichment score, the maximum ES score and its corresponding nominal
p-value.
6/27/17 page 5
478163561
DATASET: ANN ARBOR
PHENOTYPE: Non-responders vs Responders
GENE SET: BOSTON ‘non-responders’
ES = 0.59
NP < 0.001
DATASET: ANN ARBOR
PHENOTYPE: Non-responders vs Responders
GENE SET: BOSTON ‘responders’
ES = 0.48
NP = 0.005
Figure SF2b. Using the signatures of the top 100 gene markers of clinical
response from the Boston lung dataset to define a GSEA query gene sets for
‘responders’ and ‘non-responders’, we assessed their enrichment in the Ann
Arbor lung dataset. The figure contains the plot of the running enrichment score,
the maximum ES score and its corresponding nominal p-value.
6/27/17 page 6
478163561
Table ST1
Gender Dataset Marker Genes
There are 6 significant marker genes at 5% in Females
GENE
Hs.83623
Hs.83623
FAM16AX
EIF1A
DDX3X
216342_x_at
LOCATION
chrXq13.2
chrXq13.2
chrXp22.32
chrXp22.12
chrXp11.4
chr20p13
DESCRIPTION
Homo sapiens cDNA: FLJ21545 fis, clone COL06195
Homo sapiens cDNA: FLJ21545 fis, clone COL06195
Family with sequence similarity 16, member A, X-linked
Eukaryotic translation initiation factor 1Aeukaryotic translation init. factor 1A
DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, X-linked
na
There are 12 significant marker genes at 5% in Males
GENE
RPS4Y
DDX3Y
SMCY
EIF1AY
EIF1AY
USP9Y
DDX3Y
CYorf15B
USP34
Hs.433656
C1orf34
RAP1GA1
6/27/17 page 7
478163561
LOC
chrYp11.31
chrYq11.21
chrYq11.222
chrYq11.222
chrYq11.222
chrYq11.21
chrYq11.21
chrYq11.222
chr2p15
na
Chr 1
Chr 1p36.1
DESC
Ribosomal protein S4, Y-linked
DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked
Smcy homolog, Y-linked (mouse)
Eukaryotic translation initiation factor 1A, Y-linked
Eukaryotic translation initiation factor 1A, Y-linked
Ubiquitin specific protease 9, Y-linked (fat facets-like, Drosophila)
DEAD (Asp-Glu-Ala-Asp) box polypeptide 3, Y-linked
Chromosome Y open reading frame 15B
Ubiquitin specific protease 34
Homo sapiens mRNA; cDNA DKFZp434I143 (from clone DKFZp434I143)
Chromosome 1 open reading frame 34
RAP1, GTPase activating protein 1
Table ST2
GENE SET
Dataset: Lymphoblastoid cell lines
SOURCE
Enriched in Males
s1:Testis expressed genes
Experimental GNF
Enriched in Females
s2:Female reproductive tissue expressed genes
s2:Proteasome degradation genes
Experimental GNF
GenMAPP
ES
NES
NOM p-val FDR q-val
0.567
1.881
< 0.001
0.075
-0.434 -1.758
-0.594 -1.778
0.009
0.011
0.158
0.105
Table ST2
Results of the Gender example GSEA restricting the expression dataset to contain only
autosomal genes. The testis germ cell gene set is still enriched in males and the uterus &
ovarian expression gene set is still enriched in females.
6/27/17 page 8
478163561
Table ST3a
Dataset / Phenotype
# Common genes
# of Genes in Dataset
GSEA p-val
Hypergeometric p-val
Lung (Bhattacharjee et al)
Non-responders vs. Responders
10
6645
0.015
2.06E-08
Melanoma
Non-responders vs. Responders
7
8855
0.298
1.46E-05
Lymphoma (Shipp et al)
Non-responders vs. Responders
6
8855
0.476
0.00012
GCM (Ramaswamy et al)
Tumor vs. Normal
5
6645
0.11
0.0035
Lung (Garber et al)
Non-responders vs. Responders
4
3997
0.016
0.0961
Breast Cancer (West et al)
ER negative vs. ER positive
4
5391
0.197
0.0352
Astrocytoma (Khatua et al)
High grade vs. low grade
4
6645
0.117
0.0161
Prostate (Febbo et al)
Non-responders vs. Responders
3
8855
0.097
0.0254
Breast Cancer (vant Veer et al)
Non-responders vs. Responders
3
6151
0.104
0.0600
Medulloblastoma (Pomeroy et al)
Non-responders vs. Responders
2
5391
0.303
0.2740
Metastasis (Ramaswamy et al)
Metastatic vs. Non-metastatic
1
6645
0.151
0.4380
Lymphoma (Monti et al)
Non-responders vs. Responders
0
5391
0.512
0.8430
Assessment of marker overlap and GSEA enrichment of the Ann Arbor ‘responders’ gene set in other
outcome related datasets. The target phenotype is shown in red. The overlap of the top 100 gene markers in
the Ann Arbor ‘responders’ and the top 100 genes in each dataset distinction was quantified using a
hypergeometric distribution and the p-value is reported in the last column. About half of the datasets show
a significant overlap supported by a very small number of common marker genes. In contrast the only
datasets passing the GSEA test, using the enrichment of the top 100 Ann Arbor ‘responders’ as a single
gene set queried against the ranked list with respect to the relevant phenotype, are other lung outcome
datasets.
Table ST3b
Dataset / Phenotype
# Common Genes
# of Genes in Dataset
GSEA p-val
Hypergeometric p-val
Lung (Banerjee et al)
Non-responders vs. Responders
8
6645
0.001
1.54E-05
Melanoma
Non-responders vs. Responders
7
8855
0.104
1.46E-05
Astrocytoma (Khatua et al)
High grade vs. low grade
6
6645
0.09
0.0006
Lung (Garber et al)
Non-responders vs. Responders
5
3997
0.031
0.034
Breast Cancer (West et al)
ER negative vs. ER positive
5
5391
0.112
0.009
Breast Cancer (Vant Veer et al)
Non-responders vs. Responders
5
6151
0.134
0.005
GCM (Ramaswamy et al)
Tumor vs. Normal
5
6645
0.413
0.0035
Medulloblastoma (Pomeroy et al)
Non-responders vs. Responders
3
5391
0.313
0.108
Lymphoma (Shipp et al)
Non-responders vs. Responders
2
5391
0.398
0.274
Metastasis (Ramaswamy et al)
Metastatic vs. Non-metastatic
1
6645
0.411
0.438
Prostate (Febbo et al)
Non-responders vs. Responders
1
8855
0.512
0.307
Lymphoma (Monti et al)
Non-responders vs. Responders
0
8855
0.413
0.677
Assessment of marker overlap and GSEA enrichment of the Ann Arbor ‘non-responders’ gene set in other
outcome related datasets. This is the same procedure as was used in table S3a but using the ‘nonresponders’ and corresponding phenotypes in the other datasets.
6/27/17 page 9
478163561
Table ST4
Responders
GS
ES
NES
NOM p-val
FDR q-val
Ann Arbor
SIG_BCR_Signaling_Pathway
-0.44916
-1.5619
0.03156
CR_IMMUNE_FUNCTION
-0.47152
-1.508
0.06238
1
RAP_UP
-0.33092
-1.4583
0.03759
0.98011
ST_ADRENERGIC
-0.41027
-1.45
0.05776
0.7742
GLUCOSE_UP
-0.40967
-1.4397
0.06847
0.65725
SIG_PIP3_signaling_in_B_lymphocytes
-0.42587
-1.4023
0.07266
0.68813
cell_growth_and_or_maintenance
-0.33859
-1.3096
0.1133
0.95426
nos1Pathway
-0.40827
-1.2902
0.1493
0.91955
ST_GRANULE_CELL_SURVIVAL_PATHWAY
-0.37556
-1.2863
0.1556
0.83271
ST_MONOCYTE_AD_PATHWAY
-0.36374
-1.2613
0.1515
0.83686
Wnt_Signaling
-0.33115
-1.2403
0.1617
0.83842
ST_Differentiation_Pathway_in_PC12_Cells
-0.33406
-1.2206
0.1772
0.83504
ST_T_Cell_Signal_Transduction
-0.37518
-1.2042
0.2451
0.82426
-0.34
-1.1747
0.2224
0.86419
biopeptidesPathway
-0.31055
-1.1454
0.2481
0.90438
tcrPathway
-0.31297
-1.1096
0.321
0.97407
ghPathway
-0.33562
-1.0992
0.3223
0.95137
HTERT_DOWN
-0.26974
-1.0941
0.3012
0.91663
mprPathway
-0.33626
-1.075
0.3748
0.92824
ST_JNK_MAPK_Pathway
-0.32634
-1.0729
0.3602
0.88796
SIG_BCR_Signaling_Pathway
-0.47452
-1.7812
0.004107
0.51251
SIG_PIP3_signaling_in_B_lymphocytes
-0.47245
-1.6285
0.02367
0.69835
EMT_DOWN
-0.45166
-1.5797
0.0426
0.63364
ST_ADRENERGIC
-0.43025
-1.5347
0.02828
0.63162
-0.4324
-1.5079
0.02887
0.59296
SIG_CD40PATHWAYMAP
-0.40587
-1.4118
0.04555
0.8328
tcrPathway
-0.36864
-1.3817
0.1229
0.84425
MAP00280_Valine_leucine_and_isoleucine_degradation
-0.45691
-1.3811
0.1507
0.74114
ST_GRANULE_CELL_SURVIVAL_PATHWAY
-0.38853
-1.348
0.1217
0.7743
cxcr4Pathway
-0.38158
-1.3339
0.1118
0.74918
pdgfPathway
-0.35555
-1.3225
0.138
0.7186
gpcrPathway
-0.35006
-1.3163
0.1504
0.67642
MAP00380_Tryptophan_metabolism
-0.37677
-1.309
0.1607
0.64555
ST_MONOCYTE_AD_PATHWAY
-0.36001
-1.2833
0.1663
0.67407
MAP00361_gamma_Hexachlorocyclohexane_degradation
-0.40236
-1.273
0.1956
0.65948
GPCRs_Class_A_Rhodopsin-like
-0.35459
-1.2577
0.158
0.66074
CR_TRANSCRIPTION_FACTORS
-0.30923
-1.2505
0.1795
0.64271
MAP00071_Fatty_acid_metabolism
-0.36349
-1.2462
0.2104
0.61947
biopeptidesPathway
-0.31535
-1.2427
0.1585
0.59507
MAP00350_Tyrosine_metabolism
-0.40915
-1.2312
0.2667
0.59377
MAP00010_Glycolysis_Gluconeogenesis
0.49907
1.793
0.007353
0.27622
ceramidePathway
0.52954
1.724
0.008351
0.25968
INSULIN_2F_UP
0.48759
1.6979
0.01062
0.21673
HTERT_UP
0.47189
1.678
0.01724
0.18731
ST_Gaq_Pathway
1
Boston
ST_Differentiation_Pathway_in_PC12_Cells
Non-Responders
Ann Arbor
p53_signalling
0.3679
1.6644
0.006579
0.17044
CR_TRANSPORT_OF_VESICLES
0.50703
1.6579
0.01822
0.14963
breast_cancer_estrogen_signalling
0.37436
1.5762
0.00489
0.22292
Glycolysis_and_Gluconeogenesis
0.49041
1.5716
0.04825
0.20227
0.3946
1.5365
0.0437
0.23022
drug_resistance_and_metabolism
0.33507
1.5111
0.01911
0.24449
raccycdPathway
0.44724
1.475
0.05308
0.27751
MAP00240_Pyrimidine_metabolism
0.49394
1.4467
0.09534
0.29952
RAP_DOWN
0.41902
1.4202
0.1456
0.3243
PGC
0.33268
1.4087
0.09011
0.32372
CR_CELL_CYCLE
0.38487
1.4032
FRASOR_ER_DOWN
0.1188
0.31071
0.4861
1.3772
0.1717
0.33149
fmlppathway
0.40972
1.3672
0.09375
0.32909
vegfPathway
0.46551
1.3557
0.1293
0.32915
p38mapkPathway
0.35483
1.3555
0.08787
0.31204
LEU_DOWN
0.40499
1.3407
0.192
0.32247
0.43052
1.7897
0.006024
0.26871
0.5556
1.7463
0.03326
0.18368
Proteasome_Degradation
Boston
HTERT_UP
Proteasome_Degradation
INSULIN_2F_UP
0.4398
1.7012
0.01841
0.16595
0.42546
1.6223
0.06765
0.20881
0.4453
1.5376
0.09958
0.27468
mRNA_splicing
0.44383
1.5098
0.09877
0.27035
MAP00240_Pyrimidine_metabolism
0.41786
1.4369
0.118
0.34991
LEU_UP
0.32229
1.3734
0.09073
0.42796
HOXA9_UP
0.38558
1.3695
0.1127
0.38965
Glycolysis_and_Gluconeogenesis
0.41298
1.3166
0.1934
0.45523
RAP_DOWN
0.33294
1.293
0.1797
0.46433
PGC
0.27095
1.2576
0.191
0.49977
MAP00010_Glycolysis_Gluconeogenesis
0.34689
1.2511
0.1856
0.47527
CR_TRANSPORT_OF_VESICLES
0.34788
1.2205
0.2119
0.50251
FRASOR_ER_DOWN
0.26967
1.1163
0.3275
0.71557
GLUCOSE_UP
0.28339
1.1119
0.3172
0.68327
cell_motility
0.26254
1.1109
0.3111
0.64538
MAP00230_Purine_metabolism
0.26108
1.0585
0.3583
0.73718
GLUT_UP
0.23111
1.0581
0.4103
0.69916
0.275
1.0376
0.3992
0.71112
GLUT_DOWN
LEU_DOWN
P53_UP
6/27/17 page 10
478163561
The top 20 enriched pathways for both ‘responder’ and ‘nonresponder’ phenotypes using the GSEA with
the S2 (functional) database against the Ann Arbor and Boston lung datasets. Two of the enriched pathways
at FDR <0.25 are common on the non-responders side (telomerase and insulin 2F).
Table ST5
Permutation Type
Number of Gene Sets
Enriched in Males
(FDR<0.25)
Number of Gene Sets
Enriched in Females
(FDR < 0.25)
Sample Label Permutation
1
4
Gene Tag Permutation
2
38
Gene tag permutation ignores the gene-gene correlation structure in the dataset and can produce overly
optimistic results when assessing significance. This may lead to too many sets passing an FDR cutoff of
0.25. For example, the table shows the differences between phenotype label and gene tag permutations for
the Gender dataset example. The large number of gene sets passing the test using gene tag permutations
(38) is likely to include many false positives. This is an extreme case. In general the gene tag permutation
produces about twice the number of significant gene sets compared with phenotype label permutations for
the same FDR threshold. Even though we do not recommend gene tag permutation as the default, it may be
useful when the number of samples is too small to generate a sufficient number of phenotype label
permutations.
6/27/17 page 11
478163561
2
Detailed description of the GSEA method.
Inputs to GSEA.
 Expression dataset D with levels of N genes for k samples.
 A sorted gene list L and an associated vector R of gene-phenotype
correlations; alternatively a ranking procedure to produce L (including a
ranking metric M, e.g., t-test or signal to noise ratio, and 1) phenotype
class vector or 2)profile of interest C).
1. C(i) = 0 or 1 according to phenotype of sample i (class vector)
2. C(i) = an expression level in sample i.
 An independently derived gene set G of Nh genes (e.g., genes in
apathway or a cytogenetic band of interest) or an entire database or
collection of NG gene sets.
Results from GSEA.
 An enrichment score ES(G, L, R) that estimates the “enrichment” of G at
the extremes of L.
 A nominal p-value that estimates the statistical significance of the ES.
 When a collection of gene sets is used, GSEA also produces cross-geneset normalized enrichments scores (NES), and two corrections to account
for Multiple Hypotheses Testing in measuring statistical significance:
Family Wise Error (FWER) and False Discovery Rate (FDR).
Enrichment score calculation.
Legend:
ri = Correlation of gene gi with C using metric M.
N = total number of genes in L.
Nh = total number of genes in G.
I(x) = indicator function equal to 1 if argument x is true and 0 otherwise.
1. Read input gene list L or rank order the N genes in D according to the
correlation R of the genes’ expression profiles with C using metric M to form the
list L = {g1,…,gN}.
2. Compute a random walk “running score” S where every “hit” (gene in G)
increases the score by |rj|p / NR and every “miss” (gene not in G) decreases the
score by 1/(N – Nh):
[Note – if you want to subscript S let’s do it with p not W.]


p
N
i 

rj
p
1
S p ( G ,L,R,i )    I( g j G )
 I( g j G )
 , N R   I( g j G ) rj
NR
N  Nh 
j 1 
j 1
 1 4 4 2hit 4 43 1 4 4 4 2 4 4 4 3 
miss

6/27/17 page 12
478163561

S  p  max S p ( G ,L,R,i )
i 1,...,N
S 
 p
ES( G,L,R )  
S

 p
S  p  min S p ( G ,L,R,i )
i 1,...,N
if
S p  S p
if
S p  S p
The ES score is the maximum deviation from zero scores.
When p=0, the ES is identical to the Kolmogorov-Smirnov (KS) statistic that
measures the difference between two cumulative distribution functions (in this
case, the hits and the misses). For a randomly distributed G, ES(G, L, R) will be
relatively small but a G with a non-random distribution may attain extreme
values. This form of the statistic produces an un-weighted, rank-only based
measure of enrichment, whichis simple and elegant but has the disadvantage of
producing high scores for some sets with a concentration of hits in the middle of
the list. In most cases these gene sets would be considered false positives. In
addition, this form is not very sensitive when the enrichment of a gene set
derives from a small subset of hits near the top or bottom of the list. For these
reasons we view the p = 0 as a special case which is more useful if one is
interested in detecting gene sets with general non-random distributions and one
is willing to accept more false positives.
In typical applications we seek gene sets, specifically enriched or concentrated at
the top or bottom of the list. In this we set p = 1 (the default setting) where S is
weighted by the correlation of the genes in the gene list and the scores better
reflect enrichment produced by complete or partial differential expression of the
gene set at the top or bottom of the list. In most of the biological datasets that we
have studied this weighting scheme performed best at recovering known
biological results at the same time reducing the number of false discoveries.
In some selected cases, it may be more appropriated to set p = 2 to make the
weight proportional to the square of the correlations. This is useful if one expects,
for example, a potentially small subset of the gene set to be significantly enriched
at the top or bottom of the list (e.g., see the Downs example). This setting is
more sensitive to partial enrichment but at the expense of producing more false
positives and treating the distribution of hits more unevenly according to the
correlations.
Estimating ES Significance.
The significance of an observed ES( G ,L, R) for a gene set G is assessed by
comparing it with the set of scores  null G , for the same dataset with the
phenotype labels randomly permuted. The null hypothesis is that the phenotypes
are interchangeable and that the enrichment is produced by chance.
6/27/17 page 13
478163561

Compute a set  of randomly permuted phenotype labels
{ C1 ,..., C , ..., C }. For each C reorder the gene list L to produce the
corresponding L and R and compute a vector of corresponding ES
values null G ,  ES( G,L , R ) .

Estimate a standard nominal p-value for G by seeing how many values,
positive or negative according to the sign of the observed ES( G ,L, R) , of
 null G , are equal or better than ES( G ,L, R)
nominal p- val( G,L,R ) 

,
# null G ,  ES( G,L,R )


# null G ,  0


for positive ES( G ,L, R) and the corresponding expression for negative
ones.
Adjusting for Multiple Hypothesis Testing (FWER and FDR).

Determine ES( G ,L, R) for each gene set in the collection or database .

Compute a matrix  null G , for all G  and a set of permutations  .

Rescaling (row) normalization. Since the distribution of each row in the
 null G , matrix is gene set size dependent, the ES for each gene set we
normalize them before adjusting for multiple hypotheses testing to
obtainnormalized ES scores (NES). and We do this with multiplicative
rescaling, We separately dividing the positive and negative  null G , values
for a given gene set G by their mean. The “observed” ES( G ,L, R) is also
divided by the corresponding mean of the positive or negative
 null G , values according to its sign. This normalization procedure is very
effective at making all the null distributions for different gene sets collapse
into one. In addition to its empirical effectiveness, this procedure is
theoretically motivated by the asymptotic multiplicative scaling of the KS
distribution as a function of size (von Mises, R. 1964, Mathematical
Theory of Probability and Statistics, New York Academic Press). The
matrix of normalized random NES values is Gnull
,

Family Wise Error Rate (FWER). To compute the FWER for a given set
µ we create a histogram of the maximum NES value for each random
G
permutation and use this distribution to compute how many extreme
values are equal or better than the observed one:
6/27/17 page 14
478163561
µ ,L,R ) 
FWER p- val( G
  

#max   0
µ ,L,R )
# max Gnull
 NES( G
,

G 

G 
null
G ,
µ ,L, R) and the corresponding expression for negative
for positive ES( G
ones.

False Discovery Rate (FDR). To compute the FDR q-values we compute
the ratio of the null and observed (positive and negative score) CDF
µ with positive NES this
distributions for a given gene set. For a gene set G
is given by:


µ ,L,R )
# Gnull
 NES( G
,
G 

#   0
µ ,L,R ) 
FDR q- value( G

G 
null
k ,



µ ,L,R )
 # NES( G,L,R )  NES( G
G



# NES( G,L,R )  0

G 






,
and the corresponding expression for gene sets with negative NES. Notice
that the positive and negative sides are considered as independent CDFs
and the counts are normalized accordingly. This normalization is
particularly useful when the distributions of gene correlation or NES are
skewed or unequal in terms of positive and negative entries.
The final report of GSEA results includes a list of the gene sets sorted by their
NES values and columns for their nominal and FWER p-values and FDR qvalues. The nominal p-values are not corrected for multiple testing and are
usually quite optimistic. In contrast, FWER p-values tend to be more stringent
and often yield no significant gene sets. For hypotheses generation and general
use the FDR q-values may be more appropriate. We assess statistical
significance using an FDR q-value threshold of 0.25 (corresponding to at most
one out of four results being a false positive).
2.1
Description of complementary statistics.
The GSEA program computes several additional statistics that may be useful to
the sophisticated user:
Tag %: The percentage of gene tags before (for positive ES) or after (for
negative ES) the peak in the running enrichment score S. The larger the
percentage, the more tags in the gene set contribute to the final enrichment
score.
6/27/17 page 15
478163561
Gene %: The percentage of genes in the gene list L before (for positive ES) or
after (for negative ES) the peak in the running enrichment score, thus it gives an
indication of where in the list the enrichment is attained.
Signal strength: The enrichment signal strength that combines the two previous
statistics: (Tag %) x (1 – Gene %) x (N / (N - Nh). The larger this quantity the
stronger the gene set as a whole. If the genes in gene set are in the first Nh
positions in the list the signal strength is maximal or 1. If the genes are more
spread out through the list the signal strength decreases towards 0.
FDR (median): An additional FDR q-value computed by using a median null
distribution. These values are in general more optimistic than the regular FDR qvalues as the median null is a representative of the typical random permutation
null rather than extreme ones. For this reason, we don’t recommend it for
common use. However, the FDR median is sometimes useful as a binary
indicator function (zero vs. non-zero). When it is zero, it indicates that for those
extreme NES values the observed scores are larger than the values obtained by
at least half of the random permutations. One advantage of selecting gene sets
in this manner (FDR median = 0) is that a predefined threshold is not required. In
practice the gene sets selected in this way appear to be roughly the same as
those for which the regular FDR is less than 0.25. For example in the Leukemia
ALL/AML example the FDR median is zero for the top 5 sets (4 of which have
FDR < 0.25).
glob.p.val: A global nominal p-value for each gene set’s NES. This is estimated
µ ,L,R ) ,
 NES( G
by computing the number of sets that are more extreme # Gnull
,
G 


than the observed value, in each random permutation. Notice that in this
calculation, in contrast with the FDR, the number of observed scores larger than
µ ,L,R ) ).
that value is not used ( # NES( G,L,R )  NES( G
G 
2.2


Description of GSEAp output.
The results of the GSEA are stored in the “output.directory” specified by the user
as part of the input parameters to the GSEAp program. The results files are:

Two tab-separated global results text files (one for each phenotype).
These files are labeled according to the doc string prefix and the
phenotype name from the CLS (class) file:
<doc.string>.results.report.<phenotype>.txt

One set of global plots. They include a) gene list correlation profile, b)
global observed and null densities, c) heat map for the entire sorted
dataset, and d) p-values vs. NES plot. These plots are in a single JPEG
file named <doc.string>.global.plots.<phenotype>.jpg. When the program
is run interactively these plots appear on a window in the R GUI. An
6/27/17 page 16
478163561
example of this set global plot for the Leukemia S1 dataset is shown in
Fig. x.

A variable number of tab-separated gene set results files according to how
many sets pass any of the significance thresholds (“nom.p.val.threshold,”
“fwer.p.val.threshold,” and “fdr.q.val.threshold”) and how many are
specified in the “topgs” parameter. These files are named:
<doc.string>.<gene set name>.report.txt.

A variable number of gene set plots (one for each gene set report file).
These plots include a) gene set running enrichment “mountain” plot, b)
gene set null distribution and c) heat map for genes in the gene set. These
plots are stored in a single JPEG file named <doc.string>.<gene set
name>.jpg.
The format (columns) for the global result files is as follows.
GS : Gene set name.
SIZE : Number of genes in the set.
SOURCE : Set definition or source.
ES : Enrichment score.
NES : Normalized (multiplicative rescaling) normalized enrichment score.
NOM p-val : Nominal p-value (from the null distribution of the gene set).
FDR q-val : False discovery rate q-values
FWER p-val : Family wise error rate p-values.
Tag %: Percent of gene set before running enrichment peak.
Gene %: Percent of gene list before running enrichment peak.
Signal : Enrichment signal strength.
FDR (median) : FDR q-values from the median of the null distributions.
glob.p.val : P-value using a global statistic (number of sets above the
given set’s NES).
The rows are sorted by the NES values (from maximum positive or negative NES
to minimum)
The format (columns) for the individual gene set result files is as follows.
# : Gene number in the (sorted) gene set
PROBE_ID : The gene name or accession number in the dataset.
SYMBOL : gene symbol from the gene annotation file.
DESC : gene description (title) from the gene annotation file.
LIST LOC : location of the gene in the sorted gene list.
S2N : signal to noise ratio (correlation) of the gene in the gene list.
RES : value of the running enrichment score at the gene location.
6/27/17 page 17
478163561
CORE_ENRICHMENT: is this gene is the “core enrichment” section of the
list? Yes or No variable specifying if the gene location is before (positive
ES) or after (negative ES) the running enrichment peak.
The rows are sorted by the gene location in the gene list.
The function call to GSEA returns a two element list containing the two global
result reports as data frames ($report1, $report2).
Fig. x. Global plots for the Leukemia S1 example. On the top
right side there is the gene list correlation profile. On the top right
side one can see the probability density for observed and null
distribution. On the bottom left there is the heat map for the entire
dataset sorted by signal to noise, and on the bottom right one can
see a plot of the p-values vs. NES.
6/27/17 page 18
478163561
Fig x. Gene set plots for the 5q31 gene set in the Leukemia S1
example. The left plot is a running enrichment “mountain” plot that
also shows the gene tags and the correlation profile (similar to the
first plot in the global set); the one at the center is the probability
density null distribution for the particular gene set. This plot shows
the ES, NES and p-values for the set at the bottom of the plot. The
right plot is a heat map for those genes in the gene set sorted by
correlation to the phenotype.
2.3
Theoretical properties of gene tag and sample label permutation.
In this section we compute some properties of the enrichment statistic under
gene tag scrambling and sample label scrambling. We provide some average
enrichment scores and give an indication of the effect of dependence between
genes on a dataset.
Tag scrambling:
Given a rank ordered list of N genes, where N H of the genes belong to a gene
set G , we define FG and FG C as the empirical distributions of the ranks of genes
in the gene set and the ranks genes in the compliment respectively. Note that
the compliment has N  N H genes. In general N ? N H , however we do note
make this assumption in the following computations. Our randomization
procedure consists of randomly choosing N H genes as the gene set. We refer to
this as “scrambling the tags”. The one-sided enrichment statistic for a given rank
ordering is defined as follows
(1.1)
ES N, N H  max FG i  FGC i ,

6/27/17 page 19
478163561

i1,..., N


the two-sided statistic is the same except that it has a symmetric distribution
about the origin. We now compute properties of the above statistic such as its
distribution and expectation with respect to tag scrambling. The following
quantitywill be important in characterizing properties of the enrichment statistic:
n
( N  NH ) NH
N
.
When N ? N H , n is approximately N H . A reasonable approximation of the
distribution function of the enrichment statistic for n  8 is
 
   1 exp 2k  n.
Pr ES N , N H   

k
2
2
(1.2)
k  
The number of terms required for the above series to converge depends on  .
As  approaches zero, more terms are required. From the above equation, we
can compute the following density function for the enrichment statistic


 
p   4  1
k  
k 1


k 2 n exp 2k 2  2 n .
(1.3)
The number of terms required for this series to converge increases as 
decreases. The average enrichment score is simply the expectation with respect
to the above density

 
es  E p   ES N , N H 
1
 0
 p   d 
(1.4)
erf  2nk 
1
2
2
exp
2k
n

.

 4
16
k n 

k  ,k 0
Figure (plots append) displays the concentration of the enrichment density as
N H increases for and N  7,000. The following table lists the average enrichment
4

 
1
k 1


score as a function of N , N H .
N
1,000
1,000
1,000
1,000
7,000
7,000
7,000
7,000
20,000
20,000
20,000
20,000
Label scrambling:
6/27/17 page 20
478163561
NH
10
50
100
500
10
50
100
500
10
50
100
500
n
9.9
47.5
90.0
250.0
9.986
49.64
98.57
464.3
9.995
49.88
99.5
487.5
es
0.2761
0.1260
0.0916
0.0549
0.2749
0.1233
0.0875
0.0403
0.2748
0.1230
0.0871
0.0393
As described in the body of the paper, we estimate significance by permuting the
class labels, rather than randomizing the members of the gene sets. This method
of assessing significance preserves the correlation between genes and generally
yields larger p-values since genes are dependent. We will use simulations rather
than analytic results to compute the analogous quantities from the previous
section, i.e., the density of the enrichment statistic and the average enrichment
score. Note that these quantities will be data dependent.
The enrichment score is now computed as a function of the dataset D, the gene
set G, and a ranking procedure R, ES D,G, R . The null distribution of the


enrichment score is computed using label permutations as described in section
(methods) of the main body of the paper. We cannot analytically compute this
distribution as we could for gene tag randomization, but we can numerically
estimate it. The output of the label permutation procedure is a set of enrichment
scores
  
   
S  ES G, D 1 , R ,....,ES G, D   , R ,
computed over  label permutations. We can approximate the density of the
enrichment statistic as
p   histogram S .
(1.5)


The average enrichment score can be approximated as

  bin  S bin
es  E p    ES G, D, R  
# bins
i
i
,
(1.6)
i1
where # bins is the number of bins in the histogram, bin i is the average value of
the ith bin, and S bin i is the number of elements in the set of scores S with
elements in the range of the ith bin.
3
Additional applications of GSEA.
This section describes additional applications of GSEA that were omitted from
the main text because of size restrictions. The first illustrates the use of GSEA to
assess the enrichment of a single gene set to test a specific hypothesis. The
second shows the detailed results of applying GSEA to the original diabetes
dataset that was used in Mootha et al. 2003. The third example illustrates the use
of GSEA in a Downs syndrome dataset where it is appropriate to use of overweighting (square of correlation, p=2) in the enrichment score.
3.1 Sonic hedgehog (Shh) pathway. In this example we use GSEA with a preselected gene set to assess its enrichment as part of a single hypothesis. The
dataset consists of several medulloblastoma human samples. Medulloblastomas,
the most common malignant brain tumor of childhood, have two generally
accepted histological subclasses: desmoplastic and classic, whose differences
6/27/17 page 21
478163561
can be seen clearly under the microscope. Desmoplastic medulloblastomas
have been linked to dysregulated signaling of the Sonic hedgehog (Shh) pathway
by their occurrence in Gorlin’s syndrome, an autosomal dominant disorder due to
germline mutations of the Shh receptor PTCH or SuFu a downstream member
the pathway (Johnson RL, et al. Science 1996; 272:1668-1671; Hahn H, et al.
Cell 1996; 85:841-851; Taylor et al., Nat Genet 2002; 31:306-10).
This GSEA application is based on the earlier work in (Pomeroy et al. 2002)
where a cluster of Shh-regulated genes was found to be among the most highly
expressed marker genes significantly associated with sporadic desmoplastic
medulloblastomas. The appearance of these genes implied that sporadic
desmoplastic medulloblastomas, like Gorlin’s syndrome tumors, are
characterized by activation of the Shh signaling pathway. This identification was
done by a careful manual examination of highly differentially expressed genes.
GSEA was used to evaluate the enrichment of Shh genes in the gene list ranked
by correlation with the classic vs. desmoplastic distinction using Pomeroy et al.
2002 dataset B. The Shh gene set was defined by manual curation from the
literature and from previous experimental results. This set of 21 probes was
indeed found to be significantly associated with desmoplastic tumors (GSEA pval = 0.018). The enrichment results are shown in the figure SF3. Notice that not
all the genes in the pathway show coordinated behavior but enough of them
cluster at the top of the list to provide significant enrichment on the desmoplastic
phenotype.
Figure SF. GSEA results for the sonic hedgehog pathway (Shh) in
medulloblastoma. This set of genes in enriched and attains a nominal p-value of
0.018.
The enrichment results for SHH are shown in the table below. The full set of
result files for this example can be found in the GSEA/Examples/PTCH folder.
6/27/17 page 22
478163561
3.2 Diabetes.
This example presents the detailed results of applying the new GESA method
described in this paper to the original dataset of Mootha et al. 2003.
Type 2 diabetes mellitus is an complex human disease, with both genetic and
environmental factors. Numerous pathways, such as insulin signaling, free fatty
acid metabolism, glucose transport, and ATP production, have been implicated
both in vitro and in vivo models of the disease. However, microarray studies of
skeletal muscle, one of the major sites of insulin mediated glucose disposal,
failed to reveal any consistent, robust insights into disease mechanisms.
Skeletal muscle biopsies from diabetics and normal controls have not shown
large differences in gene expression (Mootha et al. 2003). The original GSEA
method was used to systematically interrogate the enrichment of a large
collection (approximately 150) of functional gene sets in differentially expressed
genes in 17 samples of skeletal muscle biopsies of patients with normal glucose
tolerance (NGT) and 17 samples of diabetes mellitus (DM2). Using traditional
single gene analysis no single gene was significantly differently expressed
between these classes. As reported in Mootha et al. 2003 this result is consistent
with previous studies of DM2 muscle. While no single gene may show significant
expression differences, entire pathways might be different between these
disease states and GSEA may be able to detect differential pathway enrichment.
The new version of the GSEA was applied to this same dataset using the
functional gene set database S2, which includes many of the 150 gene sets from
the original paper, but also about 250 additional sets. Results are shown in the
table below. There are two gene sets that pass the FDR < 0.25 threshold and are
enriched in NGTs: VOXPHOS (oxidative phosphorylation, FDR = 0.08) and
Electron Transport Chain (FDR = 0.05). This is consistent with the results from
Mootha et al. 2003 and is striking because the members of the oxidative
phosphorylation set show only a modest decrease (~15%) in DM2 vs. NGT
normal controls as individual gene markers. However, from the perspective of the
entire set, the difference is very strong. 87 out of the 112 members of the
OxPhos pathway are diminished in DM2 relative to NGT. The new GSEA method
continues to effectively detect this difference as reflected by the enrichment test.
The following table shows the top enrichment results for the Diabetes dataset.
The full set of result files for this example can be found in the
GSEA/Examples/Diabetes_S2 folder.
6/27/17 page 23
478163561
3.3 Downs syndrome.
Down syndrome was the first chromosomal disorder to have been clinically
identified (Down London Hosp Clin Lect Rep 3:259, 1866). It is characterized by
trisomy 21, and results in mental retardation, dysmorphic facies, and hypotonia.
In this example, GSEA was applied to gene expression profiles obtained from
bone marrow of 14 individuals with Down syndrome, as well as from 25 normal
controls. We sought to determine whether “chromosomal gene sets” showed
enrichment either in individuals with Down syndrome or in the controls.
When we tested all autosomal as well as X and Y chromosomal gene sets,
four sets were enriched in DS samples (see table below): chr21, chr21q21,
chr21q22 chr7p21. Note that only the following bands are used to probe the data
set as we restricted to sets with at least XXX genes. These results clearly
indicate that the genes on chromosome 21 are more highly expressed in
individuals with DS, compared to controls. The results are consistent with the
gene dosage hypothesis (J Neurol. 2002 Oct; 249(10):1347-56), which suggests
that DS results from a loss of dosage compensation (i.e., high expression of
chromosome 21 genes). The enrichment of chr21 and some its cytogenetic
bands are clearly at the top of the list but they do not achieve significance at FDR
< 0.25 unless one uses the p=2 over-weighting parameter in the enrichment
score. Entire chromosomes or large cytogenetic bands (e.g. chr21q22) are not
likely to produce strong enrichment results due to the difficulty of producing
coordinated expression behavior in such a large set of genes. (Note that that Y
chromosome in the Gender example in the main body of the paper is an
6/27/17 page 24
478163561
exception because it is rather small and is also an all or nothing signal that
produces overwhelming enrichment). In this situation, the over-weighting of the
correlations at the top or bottom of the list can expose a subtle biological signal
and the likelihood that such sets achieve significance. Thus, setting p=2 in the
enrichment score can be a useful tool but should be used with caution as it can
also produce undesirable false positives. This is the only example in this paper
that requires the use of p=2.
The following table shows the top enrichment results for the Downs dataset. The
full set of result files for this example can be found in the
GSEA/Examples/Downs_S1 folder.
4
4.1
Summary of top enrichment results for examples in paper
Gender S1
6/27/17 page 25
478163561
The following table shows the top enrichment results for the Gender dataset
using S1. The full set of result files for this example can be found in the
GSEA/Examples/Gender_S1 folder.
Enriched in Male
GS
chrY
chrYq11
chrYp11
chr4q13
chr13q13
chr11q22
chr21q22
chr6q24
chr21
chr9q21
chr2q32
chr6q23
chr15q14
chr7q21
chr5p13
chr7q11
chr9p21
chr10p12
chr16q22
chrXq22
SIZE
67
27
27
79
35
44
198
35
243
68
57
51
26
97
72
79
42
39
118
60
SOURCE
ES
Chromosome Y
-0.71693
Cytogenetic band -0.82603
Cytogenetic band -0.68959
Cytogenetic band -0.43339
Cytogenetic band -0.45164
Cytogenetic band -0.44014
Cytogenetic band -0.34127
Cytogenetic band -0.43521
Chromosome 21
-0.31274
Cytogenetic band -0.34658
Cytogenetic band -0.38534
Cytogenetic band
-0.3667
Cytogenetic band -0.36987
Cytogenetic band -0.33056
Cytogenetic band
-0.3143
Cytogenetic band -0.32827
Cytogenetic band -0.33074
Cytogenetic band -0.35044
Cytogenetic band -0.27433
Cytogenetic band -0.29782
NES
NOM p-val FDR q-val FWER p-val Tag % Gene % Signal FDR (median) glob.p.val
-2.3868
0
0 0.358
0.101 0.323
-2.2437
0
0 0.333 0.0074 0.331
-2.1303
0
0 0.481
0.165 0.403
-1.5385
0.01394
0.6026 0.266
0.122 0.234
-1.5205
0.03061
0.6286 0.314
0.159 0.265
-1.4724
0.05848
0.6927 0.364
0.216 0.286
-1.4524
0.02464
0.7147 0.369
0.225 0.288
-1.4252
0.06452
0.7477 0.286
0.204 0.228
-1.3836
0.02725
0.7808 0.346
0.225 0.271
-1.3476
0.07475
0.8158 0.338
0.208 0.269
-1.2976
0.1515
0.8539 0.281
0.173 0.233
-1.2561
0.2096
0.8929 0.216 0.0973 0.195
-1.218
0.2351
0.9009 0.615
0.344 0.404
-1.1911
0.2242
0.9109 0.247
0.169 0.206
-1.1796
0.2672
0.9139 0.319
0.198 0.257
-1.1769
0.2429
0.9149 0.266
0.185 0.217
-1.1763
0.2426
0.9159 0.333
0.171 0.277
-1.1453
0.2897
0.9299 0.333
0.197 0.268
-1.1296
0.2735
0.9339 0.153 0.0913 0.139
-1.1127
0.3247
0.9379
0.35
0.259
0.26
Enriched in Female
GS
chrXq13
chr6q15
chrXp22
chr12q23
chr2q14
chr2p11
chrXq24
chr12q22
chr11p11
chr2q31
chrXq21
chr13q14
chr1p13
chr12q13
chr5q15
chr1p21
chr3q29
chr13q22
chr12q15
chr1p32
SIZE
56
36
124
71
29
67
35
26
71
87
54
90
120
262
29
36
53
25
30
68
SOURCE
ES
NES
NOM p-val FDR q-val FWER p-val Tag % Gene % Signal FDR (median) glob.p.val
Cytogenetic band
0.57017
2.096
0
0 0.286
0.143 0.246
Cytogenetic band
0.50534
1.57
0.03373
0.5676 0.472
0.218
0.37
Cytogenetic band
0.35171
1.49
0.02107
0.6867 0.258
0.129 0.226
Cytogenetic band
0.40227 1.4586
0.06757
0.7247 0.352
0.17 0.293
Cytogenetic band
0.50669 1.4565
0.06827
0.7267 0.414
0.177 0.341
Cytogenetic band
0.40493 1.4491
0.05437
0.7327 0.224
0.104 0.201
Cytogenetic band
0.43249 1.3527
0.1235
0.8058 0.314
0.141
0.27
Cytogenetic band
0.43967 1.2993
0.18
0.8408 0.308
0.133 0.267
Cytogenetic band
0.32448 1.2677
0.1299
0.8579 0.211
0.13 0.184
Cytogenetic band
0.32082 1.2242
0.2239
0.8939 0.437
0.278 0.317
Cytogenetic band
0.34284 1.2091
0.2329
0.8999 0.352
0.219 0.275
Cytogenetic band
0.30353
1.171
0.2463
0.9179
0.3
0.196 0.242
Cytogenetic band
0.29615 1.1565
0.2871
0.9209 0.283
0.194 0.229
Cytogenetic band
0.26886 1.1522
0.2754
0.9229 0.359
0.239 0.276
Cytogenetic band
0.35382 1.1496
0.298
0.9249
0.31
0.113 0.275
Cytogenetic band
0.33477 1.1481
0.2945
0.9259 0.222 0.0554
0.21
Cytogenetic band
0.34099 1.1467
0.2925
0.9259 0.377
0.222 0.294
Cytogenetic band
0.37601
1.146
0.2979
0.9269
0.28
0.114 0.248
Cytogenetic band
0.34899 1.1352
0.2777
0.9309 0.433
0.221 0.338
Cytogenetic band
0.30976 1.1283
0.292
0.9339 0.353
0.252 0.265
4.2 Gender S2
The following table shows the top enrichment results for the Gender dataset
using S2. The full set of result files for this example can be found in the
GSEA/Examples/Gender_S2 folder.
6/27/17 page 26
478163561
4.3
P53 S2
The following table shows the top enrichment results for the P53 dataset using
S2. The full set of result files for this example can be found in the
GSEA/Examples/P53_S2 folder.
6/27/17 page 27
478163561
4.4 P53 S3
The following table shows the top enrichment results for the P53 dataset using
S3. The full set of result files for this example can be found in the
GSEA/Examples/P53_S3 folder.
6/27/17 page 28
478163561
4.5 Leukemia S1
The following table shows the top enrichment results for the Leukemia dataset
using S1. The full set of result files for this example can be found in the
GSEA/Examples/ALLAML_S1 folder.
6/27/17 page 29
478163561
4.6
Leukemia S2
The following table shows the top enrichment results for the Leukemia dataset
using S2. The full set of result files for this example can be found in the
GSEA/Examples/ALLAML_S2 folder.
6/27/17 page 30
478163561
4.7 Lung A S2.
The following table shows the top enrichment results for the Lung A dataset
using S2. The full set of result files for this example can be found in the
GSEA/Examples/Lung_A_S2 folder.
6/27/17 page 31
478163561
4.8 Lung B S2.
The following table shows the top enrichment results for the Lung B dataset
using S2. The full set of result files for this example can be found in the
GSEA/Examples/Lung_B_S2 folder.
6/27/17 page 32
478163561
5
Defining gene sets and gene set databases.
GSEA can easily be used in combination with any ordering technique and any
annotation or other gene set source. The selection of genes to include in a gene
set depends on the question being asked. For example, to test for the presence
of a growth factor signal transduction pathway, the gene set might include
ligands, receptors, and known intermediate molecules that transmit the signal to
the nucleus. Activation of a pathway can be assessed by including genes known
to be transcriptionally regulated by the pathway. In all cases, some genes will be
very unique to the pathway (e.g., PTCH and SuFu in the Shh pathway) whereas
other genes will be more general (e.g., RAS and MAPK) and less likely to be
differentially expressed across samples or conditions. Both general and specific
genes can be included, although genes with low specificity for the pathway will
potentially lower the sensitivity of the test. Gene Sets can be culled from Gene
Ontology ( ), from compilations of pathways such as KEGG ( ), GenMAPP ( ),
Humancyc ( ) and CGAP ( ) or sequence databases such as TRANSFAC ( ).
Gene sets can also be identified from a group of genes clustered together (i.e.,
co-expressed) in an experiment, genes previously implicated in disease
pathophysiology, genes in the same cytogenetic band, etc.
In some studies there may be limited previously curated information about
pathways or biological processes. In other cases, one may want to build a
systematic database of gene sets that represents biological processes relevant
to a large class of biological systems (e.g., tumors of many types). In both cases,
it is very helpful to computationally define gene sets according to an analysis
algorithm that extracts relevant molecular signatures from a large gene
expression compendium. For the purposes of this paper we built a collection of
databases of gene sets that can be used to probe microarray data sets:

Database S1 (chromosomal location): This database consists of 24 sets
corresponding to the genes on each of the 24 human chromosomes, as
well as 301 sets corresponding to cytogenetic bands. This database can
be helpful in identifying effects related to epigenetic silencing, dosage
compensation, copy number polymorphisms, and aneuploidy or other
chromosomal deletions/amplifications.

Database S2 (functional): This database includes 475 metabolic and
signaling pathways gleaned from 8 publicly available manually curated
databases. In addition, there are 51 sets representing gene expression
signatures of genetic and chemical perturbations that have been culled
from experimental results in the literature.

Database S3 (motif-based): Each set contains genes that lie downstream
of a motif that is conserved across the human, mouse, rat, and dog
6/27/17 page 33
478163561
genomes. The motifs are catalogued in [Xiohue Xie, et al.] and represent
known or likely regulatory elements in promoters and 3’-untranslated
regions.

Database S4 (correlated): Correlation gene sets are groups of genes
defined by computationally mining large-scale experimental datasets for
co-expressed genes.
As some versions of these databases were built at different times according to
where the analysis for each example was performed, we provide the specific
versions named with the example where they were used. This allows full
reproducibility of each example. Up to date microarray specific, “canonical”
versions of these databases are also distributed with the GSEA software and
those are the ones recommended for use in new examples and applications. In
addition we are in the process of creating a web site where these databases will
be able to be created and downloaded on a continuous basis.
6
Running GSEA with the GSEAPACK R package.
The GSEA program is provided in this paper’s web site in two ways: as a
standalone R package including documentation (GSEAPACK-1.0.zip), and as an
analysis module in the GenePattern environment (ref). There is also a zip file
(GSEA.Examples.zip) that contains all the data, R scripts and results of the
examples described in the paper.
Running the R package: These are the instructions to run GSEA in your
machine. You need to install release 2.0 or later of R.




Copy the GSEAPACK-1.0.zip file to your computer.
Install the GSEAPACK-1.0.zip package in your R environment by running
the Rgui and then clicking on “install packages(s) from local zip files” in
the “packages” menu. Once this is completed type “library()” in the R
prompt and you should see a list of packages including an entry for
GSEAPACK. Type "library(help=GSEAPACK)" to see all the functions
including in the package. To run GSEA as a user you will typically only call
the GSEA() main function.
To load the package type “library(“GSEAPACK”) and then “help(“GSEA”).
This opens the documentation page for the main GSEA function.
Now you can run a demo run of the code by typing “demo(allaml.demo)”.
This will execute a short run (a few random permutations) of the ALL/AML
example. It will take a few minutes and it should produce the outputs
describe in the “description of the GSEA output” earlier in this document.
This short run is intended only as a short demo and to reproduce the
results reported in the Leukemia Example section of this paper one has to
run 1000 permutations which will take over an hour of CPU time.
6/27/17 page 34
478163561



If the package installation fail don’t panic; you can still try to run the code
from raw source files as will be described below.
If you are ready to run the examples do the following:
Copy the file GSEA.Examples.zip to your machine. Expand the zip file in a
location of your file system of choice (check that the option to expand
subfolders is active). In that location a tree of subdirectories should be
created:
GSEA/Examples/ (R scripts and one folder for each example: ALLAML_S1 etc.)
GSEA/method/ GSEA.R (the R program)
GSEA/AnnotationFiles/ (Affymetrix annotation files, e.g. )
GSEA/GeneSetDatabases/ (gene set databases e.g. s1.allaml.genesetdb.gmt )
GSEA/GSEAPACK-1.0.zip (copy of the GSEAPACK R package)
The R scripts that run each individual example are under
GSEA/Examples. For example the script “Run.ALLAML_S1.R” runs the
Leukemia example. Before running this file (e.g. by cutting and pasting it
into the RGui window) make sure you modify the file pathnames to be
consistent to the location in your file system where you expanded the zip
file and created the GSEA examples’ subfolders. These scripts load the
GSEA program by performing a “library(“GSEAPACK”) call. If for any
reason you had problems installing or loading the GSEAPACK package
you can try to run the scripts in such way that they load the R source
program from GSEA/method/GSEA.R rather than from the installed
package. All you need to do is to comment out the “library(“GSEAPACK”)
line (put a “#” in front of it) and un-comment the two lines of code below:
“GSEA.program.location…” and “source(…..)”. If you do this make sure
you modify the pathname to the GSEA.R location too.
When you run those scripts you should obtain the same identical results
as reported in this document and included in the GSEA/Examples
subfolders (the random number generator seeds are explicitly set). If you
overwrite the result files when you run your version of the scripts you can
always get a copy of the originals from the zip file.
If you want to run a new dataset with GSEAPACK the easiest way is to
create a new directory under GSEA/Examples/<my dataset> and then
copy and modify for example Run.ALLAML_S1.R to point to that directory
and use the right files.
7
Running GSEA under GenePattern.
6/27/17 page 35
478163561