Download A statistical framework for genome

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epistasis wikipedia , lookup

Tag SNP wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Human genetic variation wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Oncogenomics wikipedia , lookup

Copy-number variation wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Genetic engineering wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Minimal genome wikipedia , lookup

Pathogenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

History of genetic engineering wikipedia , lookup

NEDD9 wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene therapy wikipedia , lookup

Genomic imprinting wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

The Selfish Gene wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Helitron (biology) wikipedia , lookup

Gene wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene desert wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome evolution wikipedia , lookup

Public health genomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome (book) wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
A statistical framework for genome-wide association analysis of gene sets
Qing Xiong: Institute for Genome Sciences & Policy, Duke University, Durham NC 27708.
[email protected].
Nicola Ancona: Institute of Intelligent Systems for Automation National Research Council, Bari IT 70126.
[email protected]
Elizabeth R. Hauser: Section of Medical Genetics, Department of Medicine, Center for Human Genetics,
Duke University Medical Center, Durham NC 27708. [email protected]
Sayan Mukherjee: Departments of Statistical Science, Computer Science, Mathematics, Institute for
Genome Sciences & Policy, Duke University, Durham NC 27708. [email protected] Corresponding
author
Terrence S. Furey: Institute for Genome Sciences & Policy, Duke University, Durham NC 27708.
[email protected] Corresponding author
Summary
Since single-variant/gene analyses only account for a small proportion of phenotypic variation in complex
traits and can result in a high rate of false positives, a variety of statistical or computational methods have
been developed to identify gene expression or genetic variation associated with two experimental
conditions in the pathway context. However, a computational platform for joint analysis of multiple types
of genomic data in a single statistical framework is currently not available. We propose a novel
methodology Gene Set Association Analysis (GSAA) that integrates gene expression analysis with
genome wide association (GWA) studies to extract biological insight. GSAA yields insights by taking a
priori defined gene sets – groups of genes that are putatively functional related, share similar location, or
are commonly regulated – and inferring which sets of genes are both differentially expressed and contain
associated genetic markers with respect to phenotypes or traits. Simulation studies illustrate the increase
in power from a joint analysis of genomic data using GSAA with respect to gene set methods that use
only one genomic source or a regression-based method. When applied to a real data set, our method can
not only confirm the association findings disclosed by other canonical methods, but also identified a new
candidate pathway significantly altered in glioblastoma which may suggest a potential core mechanism
leading to the disease. In addition, our method can reveal potential genetic variants associated with gene
expression variation and whether or not these functional variants are enriched in some specific pathways
underlying complex traits.
1 Introduction
Dissecting the genetic and molecular mechanism underlying complex traits and diseases has been one of
the key scientific goals in the post-genomic era. Simultaneously measuring the degree of differential
expression of genes and differential enrichment of genotypes or alleles in the genome between two
phenotypic classes or two experimental conditions are two major methods for probing the correlation
between gene/genetic variants and phenotype and for defining the genetic architecture passing down
through generations (AWKWARD). The dominant paradigm in past decades has been to identify single
genetic variants or genes most correlated with phenotypic class distinction or disease susceptibility.
However, single locus/gene analysis in general identified only a few of most significant single nucleotide
polymorphisms (SNP) or genes that account for a small proportion of phenotypic variation in complex
traits or diseases. Recently there is an ever-increasing need for carrying out pathway/gene set-based
analyses to more accurately and systematically identify the cellular processes altered in human diseases.
Pathway-based approaches can be superior to single-factor analyses with respect to at least four aspects.
First, complex traits or diseases are characterized by intricate interactions of multiple genetic variants,
genes or even pathways. It is the joint action of a variety of multi-layer genetic/information structures
such as transcriptional modules, signaling cascades, and metabolic pathways that eventually makes the
phenotypes we observed in the nature. Pathway-based analyses can capture the differential activity of an
entire structure associated with a binary trait and the interaction between distinct components thus more
accurately measuring the impact of these genetic structures on traits. Second, it can detect modest or
weakly coordinated changes in either gene expression or sequence variation in an a priori defined gene set.
This joint analysis can elicit a significant biological effect even if changes in any individual gene have a
small effect or is not significant at all. Importantly, this setting has been considered as being dominant in
many pathological processes 1; 2. Third, single-factor analyses can result in a high rate of false positives
because of the inherent noise in gene expression data or population substructure and locus heterogeneity
in SNP data. Pathway-based analyses can weaken the negative effects of perturbations not associated with
the trait of interest by inferring association from sets of biologically related genes therefore it can produce
more consistent results across different studies. Fourth, it significantly facilitates the interpretation of the
association findings by incorporating prior knowledge of biological pathways into association inference.
A variety of statistical or computational methods have been developed to identify variation in pathway
activity or function associated with a binary trait. The two types of genomic data used in these methods
have been either gene expression data or SNP genotype data and they are usually analyzed separately.
Accordingly, existing pathway-based approaches can be divided into two categories, expression-based
and SNP-based. Most of approaches were designed for expression-oriented analyses. Gene set enrichment
analysis (GSEA) adopted a weighted Kolmogorov-Smirnov (K-S) -like running-sum statistic to quantify
enrichment of differential expression in a gene set, which reflects the degree to which the gene set is
associated with a particular trait, by assessing whether genes in the gene set are randomly distributed
throughout the entire ranked list of genes in the genome or are overrepresented at the top or bottom 3.
GSEA employed a sample randomization strategy to assess the statistical significance of association
findings. This strategy can preserve the complex correlation structure of the gene expression data
therefore it is more suitable for biological experiments than gene randomization because the latter is
based on an unrealistic independence assumption between genes 4. An extension of significance analysis
of microarray (SAM), SAM-GS, also follows this paradigm of randomization. It tests a null hypothesis
that the mean vectors of expressions of genes in a gene set do not differ by the phenotype of interest based
on a SAM t-like statistic 5. Like SAM-GS, several other methods also use a mean-based test statistic for
scoring gene set, e.f. parametric analysis of gene set enrichment (PAGE) 6, generally applicable gene-set
enrichment (GAGE) 7, T-profiler 8, and random-set (RS) 9. The difference is that GAGE and T-profiler
use two-sample t-test while PAGE and RS employ a one-sample z-test. Very different strategies were
adopted by global test (GT) and ANOVA global test (AGT) 10; 11. These two global tests measure
differential expression of a gene set by testing whether similar class labels are associated with similar
gene expression patterns in the gene set based on a logistic regression model and a ANOVA model,
respectively. Gene list analysis with prediction accuracy (GLAPA) was also developed to assess
deregulation of a gene set by using a predictive statistic based on the prediction accuracy of the phenotype
of new subjects 12. Each method has its advantages and limitations, detailed discussion about issues and
underlying assumptions can be found in several comparative studies and reviews 4; 13-16. Comparatively
fewer SNP-based methods have been proposed. In SNP data, each gene is represented by a varied number
of SNPs, so representation of a gene by a single value is essential. Wang et al. 17 first tailored the GSEA
strategy to suit SNP data. They mapped SNPs to their closest gene and represented the gene by the
maximum test statistic value among all SNPs mapped to it. Peng et al. 18 considered only those SNPs
within the gene. They combine p-values of all SNPs within the gene into an overall p-value for the gene
and then p-values of all genes in the pathway into an overall p-value for the pathway. This approach
assumes that SNPs in the gene are independent so it can be applied to haplotype-tagging SNP data. Two
computational tools, SNP ratio test (SRT) 19 and GSEA_SNP 20, are also available for SNP-oriented
pathway analyses. SRT calculates an empirical p-value for each pathway by comparing the ratio of the
proportion of significant SNPs to all SNPs within genes that are part of the pathway to a null ratio
distribution generated by phenotype randomization of the dataset. GSEA_SNP is an extension of GSEA
algorithms. It converts the gene set to a corresponding SNP set and then employs a same procedure as
GSEA to test whether SNPs in the SNP set are significantly enriched at the top of the ranked list of all
SNPs. A recent paper 21 proposed an eSNP-based approach which first identifies a group of SNPs that are
associated with the change of gene expression, called eSNPs and then examines whether these eSNPs are
enriched in some specific pathways. This method doesn’t incorporate gene expression information into
the inference of association of pathways. It uses gene expression information as only a tool for filtering
out those SNPs not driving the change of gene expression so basically it is primarily a one-source
pathway analysis.
Expression-based and SNP-based pathway analyses are becoming increasingly popular. However, an
integrative platform for combining these two forms of genomic data in a single analysis is currently not
available. This is the central motivation of this study. We argue that integrating these two types of
heterogeneous but complementary data can provide greater information on the possible molecular
pathways underlying phenotypic class distinction and make the inference of association more robust and
comprehensive. In this study, we extended GSEA to incorporate SNP data. Like GSEA, we test the null
hypothesis that genes in a gene set are randomly distributed over the list of all genes ranked by their
correlation with the phenotype. The novel idea is to combine two types of evidences to infer association
of genes/gene sets with the phenotypic categories. One source of evidence is derived from an expressionbased test which gauges the enrichment of differential expression in the gene set, the other is from a SNPbased test which assesses the enrichment of significant associations in the gene set. We calculate an
overall enrichment score (ES) for each gene set by a weighted K-S-like statistic to measure the aggregated
effect of multiple correlated members in the set while filtering out or reducing the effect of pseudo signals
or noise. Our method has three advantages compared to other pathway-based approaches solely based on
gene expression data or SNP data: 1) joint analysis of gene expression variation and sequence variation
can increase the likelihood to identify genuine association signals and reduce the effects of inherent noise
in gene expression and SNP data on association inference; 2) variation in sequence can be the
fundamental cause driving the change of gene expression directly or indirectly, so the integrative analysis
assists in identifying genetic variants associated with gene expression variation and phenotypic variation
in the pathway context and whether or not these functional variants are enriched in some specific
pathways underlying complex traits or diseases; 3) our method assesses the enrichment of differentially
expressed genes and the enrichment of statistically significant SNPs in a gene set by a single test. This
greatly facilitates interpreting association findings and elucidating the regulatory mechanism in the
pathway.
We have developed a Java-based software, called gene set association analysis (GSAA) based on our
algorithms and it is freely available at gsaa.igsp.duke.edu (not sure). The software includes a user-friendly
and straightforward graphical user interface and provides full support for the visualization of results. A
separate module called gene set association analysis-SNP (GSAA-SNP) is also offered for pathway-based
analysis solely based on SNP genotype data.
2 Method
The strategy we adopted for gene set association analysis is a model based on multi-layer association tests.
The advantage of this model is it can effectively capture the association signal carried by the expression
profile of single gene and the association signal carried by the genotypes of single SNP and forms a chain
of evidence from SNP to gene and then from gene to gene set that results in the inference of associations
of gene sets that are more robust and comprehensive. Figure 1 is an overview of the method.
Figure 1. The overview of GSAA
2.1 multi-layer association tests
GSAA infers association of gene sets by a series of differential expression test of genes and differential
enrichment tests of genotypes or alleles between two groups of samples belonging to two phenotypic
classes, C1 and C2 .
2.1.1 Differential gene expression test
expression can be measured by any suitable metric. In this paper, the test statistic used is the
 Differential

difference of means scaled by the standard deviation
r
1   2
1   2
where 1 and  1 is the mean and standard deviation for class C1 , and  2 and  2 for class C2 . The
larger the absolute value of the test statistic, the stronger the gene expression is associated with the
phenotypic class distinction.
 by GSEA.
In the software, we provide the same eight statistics for
differential expression as provided
2.1.2 Single-locus association test
In our software, single-locus association analysis can be carried out by three different statistics, a
genotype-based chi-square statistic, an allele-based chi-square statistic, and the difference of major/minor
alleles between two classes. We use a chi-square two sample test which is based on binned data. The
basic idea behind the chi-square two sample test is that the observed frequency in each bin should be
similar if the two data samples come from common distributions. If the data is divided into k bins then
the test statistic is calculated as
( K R  K 2 Si ) 2
  1 i
,
Ri  Si
i 1
k
2


k
where K1 
i 1
k
i 1
Si
Ri
,
K2  1
K1
where Ri is the observed frequency for bin i for class C1 , and S i is the observed frequency for bin i for
class C2 . K1 and K 2 are scaling constants that are used to adjust for unequal sample sizes.
Suppose that the genotypes in the SNP dataset are coded as AA, AB, and BB. For the genotype-based chi
square two sample test, the three bins are AA, AB, and BB. For the allele-based chi-square two sample
 
test, thetwo bins are A and B. The test statistic value represents the degree to which the SNP is correlated
with phenotypic class distinction. Larger values mean more significant correlation. For all data analyses
in this paper, we used genotype-based single-locus association test because it can better capture the
interaction between alleles at a single locus. Simulations also indicated that results from the genotypebased test were slightly better than those from the allele-based test.
Other test statistics for single-locus analysis can easily be incorporated to the analysis.
2.1.3 SNP set association test
In SNP data, each gene and its regulatory regions are usually covered by multiple SNPs. In order to test
the association of genes with the phenotype, the first step is to map all SNPs in the SNP dataset to all
genes in the expression dataset. In our software users can specify how many base pairs (bp) upstream
and/or downstream of a gene to include in establishing the interval for SNP-gene mapping together with
the region within the gene. Then we use all SNPs within the interval to represent the gene. In the software
we designed three statistics for SNP set association test, the maximum statistic, the mean statistic, and a
weighted K-S-like statistic, discussed in detail in 2.1.5. The SNP set association test is based on the
results from single-locus analysis. For the maximum statistic, the maximum single-locus test statistic
among all SNPs mapped to the gene is used as to score evidence for the SNP set. For the mean statistic,
we take the average of all the single-locus test statistics for SNPs mapped to the gene as a score. In this
paper, we use the maximum statistic as the statistic for testing SNP set association since in the
simulations this statistic performs best among the three statistics.
Other statistics for SNP set association analysis can be used.
2.1.4 Gene association test
For each gene, we have two correlation scores both of which reflect the correlation of the gene with a
phenotype. One score is from an expression-based test and the other is from a SNP-based test. First, we
calculate a normalized correlation score. Given a dataset with N genes, the normalized correlation score
of gene i is defined by
nci 
ci
| cj |
jN
where ci or c j is the raw score from either the expression-based association test or the SNP-based
association test, c j ≥0 if ci ≥0 and c j <0 if ci <0 (NOT CLEAR).
The correlation score from the expression-based test has a sign indicating the direction of differential
expression: “+” for overexpression in class C1 , “-” for overexpression in class C2 . The SNP-based
association test does not have directionality. So we assign to the SNP-based test the same direction of
correlation to the phenotype as found from the expression-based test. (CHECK TO BE SURE THIS IS
CORRECT)


Finally, we take the sum of the two normalized correlation scores from expression-based test and SNPbased test as the correlation score of the gene.
2.1.5 Gene set association test
We ranked all of the genes in the dataset in descending order of their correlation scores. Then we evaluate
whether the genes within a gene set are enriched at the top or bottom of the ranked list of genes. The
statistic for gene set enrichment analysis is a weighted K-S-like running-sum statistic. Enrichment scores
are calculated by walking down the ranked list, increasing the statistic value when a gene is in the gene
set and decreasing it when it is not. The magnitude of the increment depends on the correlation of the
gene with the phenotype. The ES is the maximum deviation from zero encountered in walking the list.
Given a gene set S with H genes, ES 3; 17 is defined by


p
1 
 | rj |
ES ( S )  max 

,
1i  N
N
N

N
j

S
j

S
R
H 
 j i
j i


where N R 
| r
jS
j
|p
where r j is the correlation score of gene j , N is the number of genes in the dataset, N H is the number
of genes in the gene set S , p is a parameter that gives higher weight to genes with extreme statistic
values, here we follow GSEA to set p  1 . If the gene set is not associated with the phenotypic class, the
genes within it should be randomly distributed over the ranked list of all genes. In this case, the ES will
be small. The ES will be high if the genes within the gene set are overrepresented at the top or bottom of
the ranked list, in which case the gene set is associated with the phenotype.
2.2 Assessment of statistical significance and adjustment for multiple hypothesis testing
We assess the statistical significance of the ES and adjust for multiple hypothesis testing based on a
phenotype-based permutation procedure since it can preserve linkage disequilibrium (LD) structure in
SNP data and gene-gene correlation structure in gene expression data. A nominal P-value of the ES is
calculated relative to a null distribution generated by shuffling the phenotypic class labels and
recalculating the ES many times.
As in GSEA, we use the false discovery rate ( FDR ) and the family-wise error rate ( FWER ) to correct
for multiple hypothesis testing and control the proportion of false positives below a certain threshold.
The difference is that the calculation of the FDR and FWER in GSEA is based on the normalized
enrichment score ( NES ). We used the actual ES because our simulations indicated that the NES can
have problems in certain situations.
Given a gene set S * and a dataset D , the FDR is calculated as
FDR( S * ) =
% of all (S,  ) with ES(S, )  ES *
% of all(S, D) with ES(S, D)  ES *
where  denotes a permutation, all (S,  ) denotes all gene sets against all permutations of the dataset,
*
all(S, D) denotes all gene sets against the actual dataset D , ES is the enrichment score of gene set S * .
FWER is calculated as
FWER(S * ) = % of all  with highest ES(S,  )  ES * (WHAT IS HIGHEST OVER? NOT CLEAR)
2.3 Gene set association analysis-SNP (GSAA-SNP)
In our software, we designed 
a separate module called GSAA-SNP which is used for gene set association
analysis solely based on SNP genotype data. In GSAA-SNP we removed the differential gene expression
test and kept the other components the same as GSAA.
2.4 Logistic regression-based gene set association analysis-SNP (LRGSAA-SNP)
Logistic regression-based approaches are widely used in association studies. In order to compare our
method with logistic regression-based approaches, we developed LRGSAA-SNP for logistic regressionbased pathway analysis solely based on SNP data. The following is a brief description about the principle
behind LRGSAA-SNP:
Step1: Map all SNPs in the dataset to all genes in the genome, then calculate a chi-square statistic value
for each SNP which reflects its correlation with the phenotype;
Step2: Represent the gene by the SNP with the highest correlation statistic among all SNPs mapped to the
gene, then calculate a P-value for each gene based on a phenotype-base permutation procedure, and then
remove those genes with P-value larger than a cutoff threshold (0.05);
Step3: Logistic regression on the remaining genes. Here the response variable is the phenotypic class
label, and the explanatory variables are genotypes. We then obtained a regression coefficient for each
gene. A regression coefficient reflects the degree to which the gene is associated with the phenotypic
class.
Step4: Quantify the correlation of gene sets with phenotype based on these regression coefficients.
Specifically, we use the sum of the absolute values of regression coefficients of all genes in the gene set
as the correlation score of that gene set.
Step5: Permute the phenotypic class labels and then repeat step 3 and step 4 to generate a null distribution
of correlation score for each gene set, then calculate nominal P-value, FDR, and FWER using the same
method as GSAA.
3 Results
3.1 Simulation study
We conducted a comprehensive simulation study to evaluate the ability to identify causal gene sets for
four different pathway based approaches: 1) GSAA: joint association test based on gene expression and
SNPs; 2) GSEA: an expression based association test; 3) GSAA-SNP: a SNP based association test using
a weighted Kolmogorov-Smirnov (K-S)-like running-sum statistic; 4) LRGSAA-SNP: a SNP based
association test using logistic regression. The questions we want to address in the simulation study are 1)
whether we can increase the power of association tests by integrating expression and genotypic data into
pathway based approaches; 2) how parameters, such as sample size, disease model, magnitude of effect of
risk genes or genotypes or alleles, linkage disequilibrium (LD), and signal intensity in the gene set affect
the performance of the four approaches; 3) what are the advantages of GSAA compared with the other
three pathway based approaches.
3.1.1 Simulated dataset
We examine four sample sizes, 100, 200, 400, and 1200 with the same number of samples in two
phenotypic classes - case versus control for the simulation study. Simulated datasets of gene expression,
SNP, and gene sets were generated by the following criteria:
3.1.1.1 Gene expression dataset
Each gene expression dataset includes 1000 genes. Only first 20 genes are causal genes. Gene expression
values were drawn from normal distributions. Three different scenarios were simulated:
1) Expression values of causal genes are drawn from N(10.5, 1) in the case group and from N(10, 1) for all other
genes in both groups;
2) Expression values of causal genes are drawn from N(10.3, 1) in the case group and N(10, 1) for all other genes in
both groups;
3) Expression values of causal genes are drawn from N(10.1, 1) in the case group and N(10, 1) for all other genes in
both groups.
3.1.1.2 SNP dataset
Each SNP dataset includes 1000 genes. The first 20 genes are causal genes. Each causal gene covers three
SNP markers, only the second marker is in LD with the disease variant. All of other genes also have three
SNP markers, but none of them is in LD with the disease variant. Genotype data were generated by
SIMLA 22. We first generated genotype data for pedigrees and then took the proband of each pedigree to
form unrelated population samples. Parameters for SIMLA were calculated based on a susceptibility locus
rs17221417 in a published dataset of Crohn’s disease (CD) 23. This locus represents caspase recruitment
domain-containing protein 15 (CARD15, also called NOD2) which is the first confirmed CDsusceptibility gene 24; 25. Risk allele frequency is 0.287. Homozygote and heterozygote genotype relative
risk are 1.617 and 1.08, respectively. We set disease prevalence to 0.001985 according to a report 26.
(WHAT REPORT AND WHY?)
In this simulation study, we detect causal genes by indirect association, based on the LD between markers
and causal variants.
There are five different scenarios:
1) R2 between the disease variant and the second marker is 1 for all causal genes;
2) R2 between the disease variant and the second marker is 0.9 for all causal genes;
3) R2 between the disease variant and the second marker is 0.7 for all causal genes;
4) R2 between the disease variant and the second marker is 0.5 for all causal genes;
5) R2 between the disease variant and the second marker is 0.3 for all causal genes.
3.1.1.3 Gene sets
We generated 100 gene sets, each with 20 genes. Only the first gene set includes causal genes. In the
simulations, we assume that all of causal genes differ from other genes in both expression and sequence
pattern. There are four different scenarios:
1) The first gene set includes 20 causal genes;
2) The first gene set includes 15 causal genes;
3) The first gene set includes 10 causal genes;
4) The first gene set includes 5 causal genes.
3.1.2 Simulation results
For each simulation, we carried out 2000 permutations of phenotype labels to compute p-values, FDR,
and FWER. In the first stage, we repeated 30 simulations for each scenario. We did not run LRGSAASNP for sample size 100 because in this case the number of explanatory variables exceeds the number of
samples. Usually, there are ~150 SNPs with p-value less than 0.05 in each of simulated SNP dataset. The
mean of p-values, FDRs, and FWERs of causal gene set over 30 simulations are reported in Table S1-S8.
Power was calculated as the proportion of repetitions with p-value for causal gene set less than the
specified significance level (0.05). Power for over 30 simulations are reported in Document S1. In the
second stage, we repeated 200 simulations only for scenario 3 of gene expression and scenario 3 and 4 of
for SNPs for a dominant model and sample size 200 and 400. The main goal of the second stage is to
more accurately compare the ability of different approaches to detect subtle association signals. Table 1
shows the mean of p-values, FDRs, and FWERs as well as power for causal gene sets over 200
simulations. In the second stage analysis, we discarded LRGSAA-SNP since it did not perform well in the
first stage.
Table 1. Comparison of results of GSAA, GSEA, and GSAA-SNP under dominant disease model for 200
repetitions when expression level of risk genes is 0.1-unit higher
GSAA
PORG*
P
FDR
FWER Power
Sample size = 200 & R2 =0.7
1
0.0199 0.2218 0.2241
0.91
0.75
0.0691 0.4053 0.4407 0.775
0.5
0.207 0.6844 0.7364
0.4
0.25
0.4481 0.9242
0.95
0.145
Sample size = 200 & R2 =0.5
1
0.0333 0.2698 0.2795 0.875
0.75
0.0883 0.4722 0.4989 0.705
0.5
0.2371 0.7169 0.7738 0.385
0.25
0.4945 0.9214 0.9514 0.115
Sample size = 400 & R2 =0.7
1
0.0003 0.0106 0.0129
1
0.75
0.0039 0.0815 0.0812 0.975
0.5
0.0336 0.3233 0.3454
0.85
0.25
0.2486 0.734
0.779
0.375
Sample size = 400 & R2 =0.5
1
0.0008 0.0248 0.0249
1
0.75
0.0106 0.1288 0.1386 0.955
0.5
0.0689 0.4324 0.4736
0.74
0.25
0.3007
0.8
0.8364
0.31
* Percentage of risk genes in the gene set.
P
GSEA
FDR
FWER
Power
P
GSAA-SNP
FDR
FWER
Power
0.0377
0.107
0.2983
0.5277
0.3108
0.5847
0.7696
0.9135
0.335
0.6263
0.8431
0.9663
0.845
0.59
0.29
0.085
0.0148
0.0422
0.1078
0.2862
0.2876
0.478
0.705
0.8995
0.3171
0.5241
0.7597
0.9409
0.905
0.775
0.535
0.175
0.0377
0.107
0.2983
0.5277
0.3108
0.5847
0.7696
0.9135
0.335
0.6263
0.8431
0.9663
0.845
0.59
0.29
0.085
0.0737
0.1186
0.2272
0.3482
0.6198
0.6994
0.8642
0.9081
0.6825
0.7939
0.9159
0.9507
0.69
0.57
0.285
0.17
0.0033
0.0322
0.1373
0.4333
0.0693
0.3104
0.6675
0.8813
0.0758
0.3428
0.7179
0.9322
0.98
0.86
0.49
0.15
0
0.0007
0.0141
0.1011
0.0028
0.0416
0.2572
0.7203
0.0027
0.043
0.2563
0.7477
1
1
0.93
0.58
0.0033
0.0322
0.1373
0.4333
0.0693
0.3104
0.6675
0.8813
0.0758
0.3428
0.7179
0.9322
0.98
0.86
0.49
0.15
0.0018
0.0075
0.0511
0.1919
0.0745
0.2297
0.5502
0.8286
0.0725
0.2379
0.5755
0.8649
0.995
0.965
0.745
0.37
Results from 200 repetitions and 30 repetitions show a consistent trend. GSAA can indeed increase the
power to detect association signals by integrating expression based association tests and SNP based
association tests into a single statistical framework as compared to GSEA and GSAA-SNP. This is more
obvious when association signals are subtle. GSAA gives higher scores to those genes with significant
alterations in both expression profiles and sequence, which help identify real associations between genes
and the particular phenotype being studied. Both mutations in coding sequences and in regulatory regions
can cause human genetic disease. Variation in regulatory regions is a common primary mechanism
driving the changes of gene expression, so for some of important disease related genes, aberrations in
both expression level and sequence are detectable. Some investigations have demonstrated that eSNPs or
expression quantitative trait loci (eQTL) or differentially expressed genes are more likely to be detected
as disease variants in association studies 27-29. A joint test of expression variation and genotypic variation
can greatly increase the ability to pinpoint causal genes and pathways. In addition, gene expression is a
dynamic process, which can be influenced by numerous genetic and environmental factors. Sequence
variation is relatively stable, so incorporating genotypic information can reduce the effect of noise
inherent in gene expression data for inference of associations. Mutually, the differences in expression
levels can also help filter out pseudo signals in genotypic data effectively. In altered pathways, some
genes may harbor non-synonymous variants that alter the amino acid sequence of the encoded protein and
thus the structure and activity. Most of these variants have a functional effect on the phenotype. Some of
these kinds of causal genes may not show detectable differences in expression levels therefore expression
based approaches may miss them. However, GSAA can capture these kinds of association signals through
the SNP based test. Although in each of the simulated SNP datasets there are ~150 SNPs with p-value
less than 0.05, of which ~ 20 are located within causal genes, in most of the situations, GSAA can very
accurately pick out the causal gene set from 99 random gene sets and the p-value, FDR, and FWER of
causal gene set are much smaller than those of other gene sets, (SEE SUPLEMENTAL ###).
Sample size, disease models, and the magnitude of effect of risk gene sets, genes, or genotypes are
important factors that affect the performance of GSAA. As expected, larger sample size results in greater
statistical reliability. In simulations, every method achieves a power of 100% at sample size=1200 under
the dominant disease model when the association signal is strong in the gene set. Even for the weakest
association signals, namely the expression level of risk genes is 0.1-unit higher than other genes and R2
between disease locus and the second marker of risk genes is 0.3, GSAA maintains 100% power while
there are 50% or more causal genes in the gene set (Figure 2A). At sample size 100, the power of GSAASNP decreases to 70% when R2 is 0.9 and all genes are causal genes in the gene set (Figure 2B). However,
by combining the information from expression based tests and SNP based tests, GSAA still has power to
detect real associations. In simulations, we used dominant and recessive disease models. It is more
difficult to detect associations under the recessive model. In this case GSAA has an obvious advantage by
borrowing information from gene expression, which is not directly influenced by disease models.
Interestingly, LRGSAA-SNP lacks power to detect any level of association under the recessive model.
Three levels of degrees of differential gene expression, five levels of LD, and four levels of signal
intensity in risk gene set are designed to evaluate the relationship between effect size of risk gene sets,
genes or genotypes and the performance of GSAA. Detailed results are reported in Table S1-S8 and
Document S1. Overall, power increases with increasing effect size.
A
Sample size=1200; 0.1-unit upregulation in gene expression;
dominant disease model; R2=0.3
1
Power
0.8
GSAA
0.6
GSEA
0.4
GSAA_SNP
0.2
LRGSAA_SNP
0
0
0.25
0.5
0.75
1
Percentage of risk genes in the gene set
B
Sample size=100; 0.3-unit upregulation in gene expression;
dominant disease model; R2=0.9
1
Power
0.8
0.6
0.4
GSAA
0.2
GSEA
GSAA_SNP
0
0
0.25
0.5
0.75
1
Percentage of risk genes in the gene set
Figure 2. Power at the 0.05 significance level for the association test of gene sets by four different
pathway based approaches. (A) power at R2=0.3, sample size=1200, 0.1-unit up-regulation in gene
expression under dominant disease model, (B) power at R2=0.9, sample size=100, 0.3-unit up-regulation
in gene expression under dominant disease model.
Logistic regression-based approaches are widely used in association studies. We developed LRGSAASNP to test the power of logistic regression in pathway based approaches and compare it with the three KS based approaches. The results indicated that LRGSAA-SNP is very sensitive to sample size. It reaches a
power of 100% only at sample size = 1200 with 600 cases and 600 controls for the dominant disease
model. In simulations, three K-S based approaches have greater power as compared with LRGSAA-SNP
in the identification of causal pathways. They can even detect very subtle signals with 0.1-unit upregulation in the expression of risk genes or R2=0.3 between marker and causal variant when sample size
is large enough. This demonstrates that our method is valuable for pinpointing the causes of complex
diseases for which more subtle changes may dominate 1 the disease process. In addition, compared with
logistic regression based approaches, GSAA is more practical for genome-wide association analysis of
gene sets of high-density SNP arrays. In simulations, GSAA shows superior power at sample size 200
under dominant model in most of scenarios. More importantly, the power of GSAA is not influenced by
the number of SNPs in the SNP dataset. However, the number of SNPs in the SNP dataset can profoundly
affect the performance of logistic regression based methods. The number of explanatory variables
increases with the increase in the number of SNPs studied, which in turn requires the increase of sample
size. Peduzzi et al. 30 demonstrated that the number of events/samples per variable should be more than or
equal to 10 for optimal performance of logistic regression. This means that LRGSAA-SNP need at least
1500 and 15000 samples to gain enough power for reliable association tests for two SNP datasets that
include 150 and 1500 possible risk SNPs with p-value < 0.05 respectively while GSAA only need
200~400 samples to reach the same power for both datasets. Actually, the number of possible risk SNPs
being studied in a real SNP dataset from a high-density SNP array, for example Genome-Wide Human
SNP Array 6.0, may be much more than 1500. This results in that it is almost infeasible to do genomewide analysis of gene sets using LRGSAA-SNP. (QING LET US HAVE A BRIEF CONVERSATION
ON THIS SECTION, MAINLY TO MAKE IT SHORTER).
3.2 Application to data from The Cancer Genome Atlas (TCGA) pilot project
We applied our method to a published dataset of human glioblastoma31, the most common type of primary
adult brain cancer, with both gene expression data and SNP genotype data available from TCGA pilot
project (http://cancergenome.nih.gov/). The expression dataset includes 258 tumor samples and 11 normal
samples, and the SNP dataset has 205 tumor samples and 89 normal samples. We used the gene region
plus 1 kilo base pair (kb) upstream of the transcription start site (TSS) to establish the association region
of the gene. Although the size of 1kb upstream the TSS might be insufficient to cover all possible
regulatory regions for all genes, it could reasonably include both core and proximal promoters and at least
part of the distal promoter32. Two datasets of gene sets from the Molecular Signatures Database (MSigDB
http://www.broadinstitute.org/gsea/msigdb/index.jsp) were used in this analysis. One includes 639
canonical pathways; another has 1454 gene sets derived from the Gene Ontology (GO) project. Those
gene sets with the number of genes less than 15 were excluded from our analysis. 10000 permutations
were used to assess the statistical significance of enrichment scores of gene sets and adjust for multiple
hypothesis testing.
3.2.1 Significant pathways
In the dataset of canonical pathways, we identified 16 pathways enriched in tumor samples with
FDR≤0.25 (Table 2). The most significant pathway is P53PATHWAY. There are 15 genes in this
pathway (Document S1). Four genes, tumor protein p53 (TP53), retinoblastoma (RB1), E2F transcription
factor 1 (E2F1), and mouse double minute 2 homolog (MDM2) show significance from SNP based test.
Six genes, TP53, RB1, cyclin-dependent kinase 2 (CDK2), cyclin-dependent kinase 4 (CDK4),
proliferating cell nuclear antigen (PCNA) and cyclin-dependent kinase inhibitor 1A (CDKN1A, also called
p21) are highly differentially expressed between tumor and normal samples. All of these genes have been
experimentally confirmed to have a role in the development and/or progression of glioblastoma33-41.
Among them, TP53 and RB1 are well-known key players. Our analysis indicated that these two genes are
significantly associated with tumor samples in both gene expression based test and SNP based test.
Through the integrative platform, we can examine both gene expression and nucleotide sequence
aberrations in a gene set thus help better understanding the mechanism driving phenotypic change in
diseases. Also our method can highlight those key genes with significant changes in both gene expression
and sequence such as TP53 and RB1.
Table 2. Most significant canonical pathways enriched in tumor samples with FDR≤0.25
Gene set name
P53PATHWAY
RELAPATHWAY
CASPASEPATHWAY
ATRBRCAPATHWAY
HSA04115_P53_SIGNALING_PATHWAY
G1PATHWAY
HSA03030_DNA_POLYMERASE
DNA_REPLICATION_REACTOME
G2PATHWAY
CELL_CYCLE_KEGG
G1_TO_S_CELL_CYCLE_REACTOME
TIDPATHWAY
TNFR2PATHWAY
MITOCHONDRIAPATHWAY
STATIN_PATHWAY_PHARMGKB
HSA04610_COMPLEMENT_AND_COAGULATION_CASCADES
Size
15
16
22
18
60
23
21
43
21
78
61
17
18
20
16
64
Nominal P
0.021
0.013
0.018
0.097
0.007
0.063
0.124
0.199
0.059
0.069
0.150
0.070
0.080
0.023
0.026
0.028
FDR
0.130
0.157
0.165
0.169
0.182
0.190
0.205
0.209
0.216
0.223
0.249
0.249
0.249
0.249
0.250
0.250
For detailed results, see Table S9 and Table S10 in the supplementary materials.
51 gene sets are significantly associated with tumor samples in the dataset of GO gene sets with P<0.05
(Table S11). But after correction for multiple hypothesis testing, none of these association reach
significance (FDR≤0.25). 40 gene sets are significantly enriched in normal samples at FDR≤0.25 (Table
S12). The top three gene sets are VOLTAGE_GATED_CALCIUM_CHANNEL_COMPLEX
(FDR=0.002),
VOLTAGE_GATED_CALCIUM_CHANNEL_ACTIVITY
(FDR=0.004),
and
CALCIUM_CHANNEL_ACTIVITY (FDR=0.01), implicating a possible deregulation of voltage-gated
calcium channel in glioblastoma.
3.2.2 Genes enriched in the leading edge subsets of top 20 pathways
We noticed that some genes play roles in multiple pathways. We are more interested in these types of hub
genes because 1) major-effect hub genes are important contributors to the enrichment scores of multiple
top-ranked pathways and they may be the main reason why these pathways exhibit significant association
with the phenotype; 2) some of hub genes may not be a major-effect risk gene in a single pathway, but the
cumulative effect in multiple pathways makes them critical to the phenotype because they connect
multiple phenotype associated pathways, and alteration of these genes may significantly affect the
phenotype in multiple aspects; 3) multi-pathway analysis can greatly increase the likelihood of detecting
true phenotype-associated genes as currently pathway annotation is not accurate and complete so those
multi-pathway players may represent more reliable signals and most likely have real functions in
phenotype being studied. To identify a core set of hub genes involved in glioblastoma, we did leading
edge analyses for the top 20 pathways and then count the number of occurrence of each gene in leading
edge subsets of top 20 pathways. Figure S1 and S2 (also see Table S13-S14) show the distribution of the
number of occurrence, correlation scores in gene expression based test and P values in SNP based test for
those hub genes appearing in the leading edge subsets of at least two top pathways. We identified two sets
including 46 and 51 genes for datasets of canonical pathways and GO gene sets, respectively. Then we
evaluated if this two sets of hub genes include core components of major pathways altered in human
glioblastoma by comparing them with those genes from three core pathways: p53 signaling, RB signaling,
and RTK/RAS/PI(3)K signaling, which were defined by a recent paper based on an comprehensive
analysis of DNA copy number, gene expression, DNA methylation and nucleotide sequence aberrations 31.
We found that genes in p53 signaling and RB signaling pathways are indeed enriched in our core set of
hub genes. Our core set covers 7 out of 10, including cyclin-dependent kinase inhibitor 2A (CDKN2A),
MDM2, TP53, cyclin-dependent kinase inhibitor 2C (CDKN2C), CDK4, cyclin d2 (CCND2), and RB1. In
addition, the activation and sequence aberrations of epidermal growth factor receptor (EGFR) in
RTK/RAS/PI(3)K signaling were confirmed by our joint association test of gene expression profiles and
SNP genotypes as well. Surprisingly, we observed that genes involved in apoptosis signaling pathway are
strikingly enriched in the core set, indicating the deregulation of apoptosis signaling pathway could be a
major factor leading to the development of glioblastoma.
3.2.3 Suggested new core pathway in glioblastoma
Our pathway based association analysis not only confirmed the critical roles of those three core pathways
revealed by the Cancer Genome Atlas Research Network but also uncovered a new possible core pathway
in glioblastoma. It is even more significant than those three in our results. Our analysis indicated that this
new pathway may include at least three members: coagulation factor II receptor (F2R) which more
usually called protease-activated receptor 1 (PAR1), caspase 8 (CASP8), and caspase 3 (CASP3). We call
this pathway as PAR/CASP signaling. In our statistical test, PAR1 is most significantly altered in both
expression and sequence in tumor samples. It obtained highest correlation score from expression based
test and lowest P value (p<10-15) from SNP based test among more than 7000 genes (Table S15-S18). It
also occurs in seven of leading edge subsets of top 20 Go gene sets. PAR1 is expressed throughout the
brain on neurons and astrocytes and has important roles in the central nervous system (CNS) 42. PAR1 is
also functional in human glioblastoma cells 42; 43. Except for its well known roles in coagulation and
hemostasis, PAR1 has been found to induce or inhibit apoptosis depending on the dosage of its
physiological agonist thrombin and can mediate anti-apoptotic signaling in nervous system through the
interaction with activated protein C (APC) and caspases (more likely CASP8 and CASP3) 44-47. Junge et al.
reported that human glioblastoma cells respond to PAR1 activation by increasing intracellular Ca2+ 42. Our
results indeed indicate that three calcium channel pathways are significantly altered (Table S4).
Interestingly, most of genes of calcium channel, voltage-dependent (CACN) family in these pathways are
down-regulated in tumor samples. This might be an adaption to the increase of Ca2+ triggered by the
activation of PAR1.There are also big differences of genotypes of these genes between tumor and normal
samples. Panner et al. 48 demonstrated the involvement of T-type Ca2+ channel in the proliferation of both
glioma cells and neuroblastoma cells. In addition, several investigations observed a link between
thrombin and glioma pathogenesis 49-51
Glioblastomas are the most malignant brain tumors, which are characterized by cellular resistance to
apoptosis and a highly invasive growth pattern. Failure of apoptosis is one of the main contributions to
tumorigenesis and caspases play essential roles in apoptosis, necrosis and inflammation 52; 53. Georg et al.
54
found that human glioblastoma cells exhibit a constitutive activation of caspases in vivo and in vitro.
The inhibition of CASP3 and CASP8 decreases the migration and invasiveness of glioma cells. In our
analysis, the expression of CASP3 and CASP8 is markedly up-regulated in tumor samples. CASP8
shows significant change in sequence as well (P<10-15). We also noticed the activation of other caspases,
CASP1, CASP4, CASP6 and CASP7. In addition, CASPASEPATHWAY is significantly associated with
tumor
samples
in
the
dataset
of
canonical
pathways.
POSITIVE_REGULATION_OF_CASPASE_ACTIVITY pathway is also enriched in tumor samples in
the dataset of GO gene sets. CASP3 and CASP8 are involved in 9 and 8 of top 20 pathways, respectively.
Other genes in apoptosis signaling pathway, for example Fas-associated via death domain (FADD), tumor
necrosis factor receptor superfamily, member 6 (TNFRSF6 or FAS), BCL2-associated X protein (BAX) et
al, are also enriched in the leading edge subset of top 20 pathways. These results suggested the
deregulation of signal transduction pathways involved in apoptosis in glioblastoma. PAR1 and caspases
may be important components related to the deregulation of apoptosis in glioblatoma. PAR1 is an
upstream regulator of the activity of caspases, so our analysis may uncover a correlation between PAR1
and caspases as well as their collaboration mediating anti-apoptotic signal leading to disease. Since the
mechanism underlying strong resistance of glioma cells to apoptosis is still partly understood, our
analysis could provide new insights into the pathogenesis of glioblastoma and new therapeutic targets in
the treatment of glioblastoma. Whether or not PAR/CASP signaling has an important role in the initiation
and/or progression of glioblastoma merits further experimental investigations. Since the Cancer Genome
Atlas Research Network didn’t mention caspase pathway or caspases in their original publication, so our
pathway based approach can complement other canonical methods to provide more powerful statistical
framework to detect and analyze the variation of gene expression, sequence, or both that lead to
glioblastoma and other complex diseases.
4 Discussion
Genome-wide gene expression profiling and genotyping offer unparalleled opportunities to pinpoint
genomic and genetic determinants of complex traits or diseases and to elucidate the interaction network
among them at the genome level. Integrating these two types of distinct but complementary data into a
single analysis can enhance genuine association signals and increase statistical power for pathway-based
association tests. It can also suggest cis- and trans-regulatory variation associated with expression
variation of genes in the altered pathway and offer a deeper understanding of complex diseases. In this
study, we developed a novel statistical framework to integrate gene expression data and SNP data into
genome-wide association analysis of gene sets. The simulations indicated that the integrative analysis can
indeed increase the ability to detect real association signals. When applied to a real data set, our method
can not only confirm the association findings unraveled by other canonical methods, but also identified a
new candidate pathway significantly altered in glioblastoma which may suggest a potential core
mechanism leading to the disease. Results from both simulated data and real data demonstrates that the
integrating various types of –omics data into a single statistic framework for gene set association analysis
is a promising field that is well worth further study. Here we discuss some open questions concerning our
methodology and implementation, pointing out the potential rooms for improvement and expecting a
more powerful solution offered in the future.
4.1 Genotypic association test and allelic association test
In our software, three options for single locus association analysis are available, genotype-based chisquare test, allele-based chi-square test, and frequency difference of major/minor alleles between cases
and controls. In this paper we only used a genotype-based chi-square two sample test statistic to
determine genotype-phenotype correlation of each SNP. The simulations show that in GSAA genotypic
association test can obtain better results compared with allelic association test (Table 3). The possible
explanation may be that genotypes represent individuals. Traits are expressed at individual level.
Genotypic test can capture the interaction between two alleles and more accurately assess the joint effect
of two alleles in a single locus.
Table 3. Comparison of GSAA results based on genotypic test and allelic test under dominant disease
model when expression level of risk genes is 0.3-unit higher and sample size is 400
R2=0.9
R2=0.7
Genotypic test
Allelic test
Genotypic test
Allelic test
PORG*
P
FDR
FWER
P
FDR
FWER
PORG*
P
FDR
FWER
P
FDR
FWER
1
<10-15
<10-15
<10-15
<10-15
<10-15
<10-15
1
<10-15
<10-15
<10-15
<10-15
<10-15
<10-15
0.75
<10-15
<10-15
<10-15
<10-15
<10-15
<10-15
0.75
<10-15
<10-15
<10-15
<10-15
<10-15
<10-15
0.5
<10-15
<10-15
<10-15
<10-15
0.00006 0.00007
0.5
<10-15
<10-15
0.00022
0.00025
0.01201 0.20204 0.19237
0.25
0.23606
0.25718
0.25
0.00577 0.18854
0.17933
0.00004 0.00005
0.01212 0.24715 0.25242 0.01191
For detailed results, see Table S19 in the supplementary materials.
4.2 SNP-gene mapping
A fundamental difference between SNP data and gene expression data is that in gene expression data the
gene is the smallest unit carrying the association information. For SNP data, a gene usually covers
multiple SNPs and we need to evaluate the joint effect of a set of SNPs mapped into the gene to determine
the association. So the first step is to map SNPs to genes. It is almost unfeasible at this time to exactly
determine how many SNPs can affect a particular gene and where they are in the genome. There is no
clear-cut boundary defining the regulatory region of a gene since some enhancers and repressors may be
far away from the target gene. Also the LD block surrounding each trait-associated variant is variable
with regard to pattern and length, which may encompass from 0kb to 500kb. Wang et al. mapped the
SNPs to the closest gene 17. Peng et al. used all SNPs within a gene to represent that gene 18. Different
projects may have different requirements on the possible association region related to a gene. Some
people maw want to use only genic variants for association analysis since genic variants are more likely
to affect disease risk compared to genetic variants located outside genes while others prefer to extend the
gene region to incorporate regulatory region for examining the contribution of cis- and trans-regulatory
variants to the disease susceptibility. In our software we did not fix the mapping criteria and intend to let
users determine the mapping regions. Two options are available to specify how many base pairs upstream
and/or downstream of a gene will be included to establish the association region of that gene. We think
this method allows for flexibility across different projects.
4.3 SNP set association test
In SNP data, each gene is represented by a varied number of SNPs. There is still no consensus for the best
way to assess the joint contribution of a set of SNPs mapped to the same gene to the association. Two
forms of association patterns of a SNP set may exist in the association region of a gene: 1) the region
harbors only one risk variant; 2) the region harbors multiple risk variants independently contributing to
the overall association signal. We use a maximum statistic to assign the highest correlation score among
all SNPs mapped to the gene as the correlation score of the gene. Compared to those test statistics that
combine correlation scores or p-values across all SNPs in the region into a single correlation score or pvalue, the maximum statistic can more effectively eliminate the negative effects of correlation structure
between SNPs, SNPs not associated with the trait of interest, and difference in SNP set size on association
inference. The maximum statistic should be the best way to measure association signals when the region
harbors one risk variant because multiple markers may be in strong LD with the risk variant. Apparently,
this statistic cannot accurately capture overall association information in the second situation where
multiple independent risk variants coexist. Fortunately, in this case GSAA can borrow information from
differential expression to compensate for this loss of information in the SNP based test. With the help of
gene expression, the maximum statistic may be an excellent tradeoff between the two patterns of SNP set
association in a gene.(AWKWARD)
4.4 Pathway annotation
One of the merits of gene set association analysis is that it can take advantage of prior knowledge of
biological pathways. Pathway analyses measure associations of multiple genes simultaneously so it is
well suitable for situations where small coordinated changes in a pathway contribute to the overall
association signal and result in a significant biological effect. In addition, the genes in a pathway are
biologically correlated. This helps interpret the results and also can produce more coherent results across
different experimental platforms or strategies. However, this dependence on a priori knowledge can also
be considered as a limitation for this type of analysis since the analysis restricted by current knowledge
and biases. This means that we may get inaccurate or incomplete information about the pathways
resulting in inaccurate association inferences. Also, the gene sets/pathways we use are heterogeneous and
we lack the knowledge of the interactive patterns between components within the gene set. Gene sets can
be derived from transcriptional modules, metabolic pathways, a cluster of co-expressed genes or even a
group of genes tightly linked on the chromosome. This kind of ambiguity results in that it is difficult to
establish an unified strategy for describing the knowledge structure in the gene set in order to improve
current algorithms. So there may be a need for a method to identify the substructure of pathways and
classify them into different categories for optimizing the performance of gene set association analyses.
However, fortunately the pathway annotation is becoming increasingly accurate with the accumulation of
our knowledge on biological processes and this will definitely increase the power of gene set association
test continually. (NEEDS TO BE CLEANED UP AND SHORTENED)
4.5 Normalization and multiple hypothesis testing
The normalized enrichment score (NES) is the primary statistic for examining gene set enrichment results
in GSEA. GSEA calculates NES by dividing the actual ES by the mean of all ES against all permutations
of the dataset for a given gene set. GSEA uses NES to account for the differences in gene set size and in
correlations between gene sets and the expression dataset and calculate false discovery rate (FDR).
However, in simulations we found that NES has problems under certain specified conditions. We
designed 10 gene sets each with 10 genes. These ten gene sets from 1 through 10 include 10, 9, 8, 7, 6, 5,
4, 3, 2, 1 causal genes, respectively. We then used these 10 gene sets to test GSEA by various simulated
datasets of gene expression. There are ten different scenarios:
1) Expression values of causal genes are drawn from N(10.1, 1) in the case group, others are drawn from N(10, 1);
2) Expression values of causal genes are drawn from N(10.3, 1) in the case group, others are drawn from N(10, 1);
3) Expression values of causal genes are drawn from N(10.5, 1) in the case group, others are drawn from N(10, 1);
4) Expression values of causal genes are drawn from N(11, 1) in the case group, others are drawn from N(10, 1);
5) Expression values of causal genes are drawn from N(11.5, 1) in the case group, others are drawn from N(10, 1);
6) Expression values of causal genes are drawn from N(12, 1) in the case group, others are drawn from N(10, 1);
7) Expression values of causal genes are drawn from N(12.5, 1) in the case group, others are drawn from N(10, 1);
8) Expression values of causal genes are drawn from N(13, 1) in the case group, others are drawn from N(10, 1).
9) Expression values of causal genes are drawn from N(13.5, 1) in the case group, others are drawn from N(10, 1);
10) Expression values of causal genes are drawn from N(14, 1) in the case group, others are drawn from N(10, 1);
Table 4. Comparison of GSEA results based on NES and ES when expression level of risk genes is 2.5-unit higher
and sample size is 400
NES
ES
NAME
RANK
ES
NES
P
FDR
FWER
NAME
RANK
ES
P
FDR
FWER
GENESET9
1
0.86
2.11
0
0.004
0.004
GENESET1
1
1.00
0
0
0
GENESET10
2
0.82
2.06
0
0.004
0.007
GENESET2
2
0.99
0
0.001
0.002
GENESET8
3
0.91
2.04
0
0.003
0.007
GENESET3
3
0.99
0
0
0.002
GENESET7
4
0.94
2.04
0
0.002
0.007
GENESET4
4
0.98
0
0.002
0.005
GENESET6
5
0.97
1.96
0
0.005
0.024
GENESET5
5
0.98
0
0.002
0.007
GENESET5
6
0.98
1.88
0
0.014
0.078
GENESET6
6
0.97
0
0.002
0.009
GENESET4
7
0.98
1.82
0
0.032
0.186
GENESET7
7
0.94
0
0.004
0.016
GENESET3
8
0.99
1.75
0
0.059
0.35
GENESET8
8
0.91
0
0.008
0.03
GENESET2
10
0.99
1.67
0
0.094
0.577
GENESET9
9
0.86
0
0.018
0.068
GENESET1
12
1.00
1.63
0
0.111
0.709
GENESET10
10
0.82
0
0.029
0.101
For detailed results, see Table S20 in the supplementary materials.
There is no problem with scenario 1-3. However, we found that the rankings of gene sets are almost
completely reversed when causal gene expression is 2-unit higher or higher in the case group. The
rankings get back to normal if we use actual ES instead of NES (Table 4). In the calculation of ES, we
weight genes in a given gene set by their correlation with phenotype of interest normalized by the sum of
the correlations over all of the genes in the gene set. This process can account for the differences in gene
set size to some degree. Also, FDR and FWER calculated based on permutations of phenotype labels can
be used to account for the size of gene sets and adjust for multiple hypothesis testing. Therefore in this
study FDR and FWER were calculated based on the actual ES. However, like GSEA, we offer both of ES
and NES based analysis in the software. (MOVE TO SUPPLEMENTAL SECTION, WATERS DOWN
THE POINT OF OUR PAPER)
4.6 Future directions
In this study, we developed a novel statistical framework to integrate gene expression data and SNP data
into a genome-wide association analysis of gene sets. Our method can provide great insights into
biological processes, such as signal transduction pathways, metabolic pathways, and other physiological
or pathological processes, that are associated with traits or diseases and the underlying mechanism
regulating these processes. To our knowledge, this is the first computational platform for integrative
genome-wide association analysis of gene sets based on two types of different but complementary
genomic data from two of most popular high-throughput technologies – genome-wide expression array
and SNP array, and it thereby directs future research into more comprehensive integration of genetic,
genomic, and proteomic data in pathway based statistical framework. Although we use only two types of
genomic data in this study, obviously our framework can be extended easily to incorporate more forms of
–omics data, such as copy number variation, methylation data, and microRNA data. We are working
towards providing a more powerful statistical and computational platform for genome-wide association
analysis of gene sets in the near future.
References
1. Carlson, C.S., Eberle, M.A., Kruglyak, L., and Nickerson, D.A. (2004). Mapping complex disease loci in
whole-genome association studies. Nature 429, 446-452.
2. Mootha, V.K., Lindgren, C.M., Eriksson, K.F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P.,
Carlsson, E., Ridderstrale, M., Laurila, E., et al. (2003). PGC-1alpha-responsive genes involved in
oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 34,
267-273.
3. Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A.,
Pomeroy, S.L., Golub, T.R., Lander, E.S., et al. (2005). Gene set enrichment analysis: a
knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci
U S A 102, 15545-15550.
4. Goeman, J.J., and Buhlmann, P. (2007). Analyzing gene expression data in terms of gene sets:
methodological issues. Bioinformatics 23, 980-987.
5. Dinu, I., Potter, J.D., Mueller, T., Liu, Q., Adewale, A.J., Jhangri, G.S., Einecke, G., Famulski, K.S.,
Halloran, P., and Yasui, Y. (2007). Improving gene set analysis of microarray data by SAM-GS.
BMC Bioinformatics 8, 242.
6. Kim, S.Y., and Volsky, D.J. (2005). PAGE: parametric analysis of gene set enrichment. BMC
Bioinformatics 6, 144.
7. Luo, W., Friedman, M.S., Shedden, K., Hankenson, K.D., and Woolf, P.J. (2009). GAGE: generally
applicable gene set enrichment for pathway analysis. BMC Bioinformatics 10, 161.
8. Boorsma, A., Foat, B.C., Vis, D., Klis, F., and Bussemaker, H.J. (2005). T-profiler: scoring the activity of
predefined groups of genes using gene expression data. Nucleic Acids Res 33, W592-595.
9. Newton, M.A., Quintana, F.A., Den Boon, J.A., Sengupta, S., and Ahlquist, P. (2007). Random-set
methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann Appl Stat 1,
85-106.
10. Goeman, J.J., van de Geer, S.A., de Kort, F., and van Houwelingen, H.C. (2004). A global test for
groups of genes: testing association with a clinical outcome. Bioinformatics 20, 93-99.
11. Mansmann, U., and Meister, R. (2005). Testing differential gene expression in functional groups.
Goeman's global test versus an ANCOVA approach. Methods Inf Med 44, 449-453.
12. Maglietta, R., Piepoli, A., Catalano, D., Licciulli, F., Carella, M., Liuni, S., Pesole, G., Perri, F., and
Ancona, N. (2007). Statistical assessment of functional categories of genes deregulated in
pathological conditions by using microarray data. Bioinformatics 23, 2063-2072.
13. Allison, D.B., Cui, X., Page, G.P., and Sabripour, M. (2006). Microarray data analysis: from disarray to
consolidation and consensus. Nat Rev Genet 7, 55-65.
14. Khatri, P., and Draghici, S. (2005). Ontological analysis of gene expression data: current tools,
limitations, and open problems. Bioinformatics 21, 3587-3595.
15. Abatangelo, L., Maglietta, R., Distaso, A., D'Addabbo, A., Creanza, T.M., Mukherjee, S., and Ancona,
N. (2009). Comparative study of gene set enrichment methods. BMC Bioinformatics 10, 275.
16. Liu, Q., Dinu, I., Adewale, A.J., Potter, J.D., and Yasui, Y. (2007). Comparative evaluation of gene-set
analysis methods. BMC Bioinformatics 8, 431.
17. Wang, K., Li, M., and Bucan, M. (2007). Pathway-Based Approaches for Analysis of Genomewide
Association Studies. Am J Hum Genet 81.
18. Peng, G., Luo, L., Siu, H., Zhu, Y., Hu, P., Hong, S., Zhao, J., Zhou, X., Reveille, J.D., Jin, L., et al. (2010).
Gene and pathway-based second-wave analysis of genome-wide association studies. Eur J Hum
Genet 18, 111-117.
19. O'Dushlaine, C., Kenny, E., Heron, E.A., Segurado, R., Gill, M., Morris, D.W., and Corvin, A. (2009). The
SNP ratio test: pathway analysis of genome-wide association datasets. Bioinformatics 25, 27622763.
20. Holden, M., Deng, S., Wojnowski, L., and Kulle, B. (2008). GSEA-SNP: applying gene set enrichment
analysis to SNP data from genome-wide association studies. Bioinformatics 24, 2784-2785.
21. Zhong, H., Yang, X., Kaplan, L.M., Molony, C., and Schadt, E.E. (2010). Integrating Pathway Analysis
and Genetics of Gene Expression for Genome-wide Association Studies. Am J Hum Genet.
22. Schmidt, M., Hauser, E.R., Martin, E.R., and Schmidt, S. (2005). Extension of the SIMLA package for
generating pedigrees with complex inheritance patterns: environmental covariates, gene-gene
and gene-environment interaction. Stat Appl Genet Mol Biol 4, Article15.
23. (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared
controls. Nature 447, 661-678.
24. Ogura, Y., Bonen, D.K., Inohara, N., Nicolae, D.L., Chen, F.F., Ramos, R., Britton, H., Moran, T.,
Karaliuskas, R., Duerr, R.H., et al. (2001). A frameshift mutation in NOD2 associated with
susceptibility to Crohn's disease. Nature 411, 603-606.
25. Rioux, J.D., Daly, M.J., Silverberg, M.S., Lindblad, K., Steinhart, H., Cohen, Z., Delmonte, T., Kocher, K.,
Miller, K., Guschwan, S., et al. (2001). Genetic variation in the 5q31 cytokine gene cluster confers
susceptibility to Crohn disease. Nat Genet 29, 223-228.
26. Loftus, E.V., Jr., Schoenfeld, P., and Sandborn, W.J. (2002). The epidemiology and natural history of
Crohn's disease in population-based patient cohorts from North America: a systematic review.
Aliment Pharmacol Ther 16, 51-60.
27. Gorlov, I.P., Gallick, G.E., Gorlova, O.Y., Amos, C., and Logothetis, C.J. (2009). GWAS meets
microarray: are the results of genome-wide association studies and gene-expression profiling
consistent? Prostate cancer as an example. PLoS One 4, e6511.
28. Nica, A.C., Montgomery, S.B., Dimas, A.S., Stranger, B.E., Beazley, C., Barroso, I., and Dermitzakis, E.T.
(2010). Candidate causal regulatory effects by integration of expression QTLs with complex trait
genetic associations. PLoS Genet 6.
29. Nicolae, D.L., Gamazon, E., Zhang, W., Duan, S., Dolan, M.E., and Cox, N.J. (2010). Trait-associated
SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet 6.
30. Peduzzi, P., Concato, J., Kemper, E., Holford, T.R., and Feinstein, A.R. (1996). A simulation study of
the number of events per variable in logistic regression analysis. J Clin Epidemiol 49, 1373-1379.
31. (2008). Comprehensive genomic characterization defines human glioblastoma genes and core
pathways. Nature 455, 1061-1068.
32. Bortoluzzi, S., Coppe, A., Bisognin, A., Pizzi, C., and Danieli, G.A. (2005). A multistep bioinformatic
approach detects putative regulatory elements in gene promoters. BMC Bioinformatics 6, 121.
33. El Hallani, S., Ducray, F., Idbaih, A., Marie, Y., Boisselier, B., Colin, C., Laigle-Donadey, F., Rodero, M.,
Chinot, O., Thillet, J., et al. (2009). TP53 codon 72 polymorphism is associated with age at onset
of glioblastoma. Neurology 72, 332-336.
34. Ishii, N., Maier, D., Merlo, A., Tada, M., Sawamura, Y., Diserens, A.C., and Van Meir, E.G. (1999).
Frequent co-alterations of TP53, p16/CDKN2A, p14ARF, PTEN tumor suppressor genes in human
glioma cell lines. Brain Pathol 9, 469-479.
35. Backlund, L.M., Nilsson, B.R., Goike, H.M., Schmidt, E.E., Liu, L., Ichimura, K., and Collins, V.P. (2003).
Short postoperative survival for glioblastoma patients with a dysfunctional Rb1 pathway in
combination with no wild-type PTEN. Clin Cancer Res 9, 4151-4158.
36. Blum, R., Nakdimon, I., Goldberg, L., Elkon, R., Shamir, R., Rechavi, G., and Kloog, Y. (2006). E2F1
identified by promoter and biochemical analysis as a central target of glioblastoma cell-cycle
arrest in response to Ras inhibition. Int J Cancer 119, 527-538.
37. Khatri, R.G., Navaratne, K., and Weil, R.J. (2008). The role of a single nucleotide polymorphism of
MDM2 in glioblastoma multiforme. J Neurosurg 109, 842-848.
38. Zhang, R., Banik, N.L., and Ray, S.K. (2008). Combination of all-trans retinoic acid and interferongamma upregulated p27(kip1) and down regulated CDK2 to cause cell cycle arrest leading to
differentiation and apoptosis in human glioblastoma LN18 (PTEN-proficient) and U87MG (PTENdeficient) cells. Cancer Chemother Pharmacol 62, 407-416.
39. Lam, P.Y., Di Tomaso, E., Ng, H.K., Pang, J.C., Roussel, M.F., and Hjelm, N.M. (2000). Expression of
p19INK4d, CDK4, CDK6 in glioblastoma multiforme. Br J Neurosurg 14, 28-32.
40. Korshunov, A., Golanov, A., Sycheva, R., and Pronin, I. (1999). Prognostic value of tumour associated
antigen immunoreactivity and apoptosis in cerebral glioblastomas: an analysis of 168 cases. J
Clin Pathol 52, 574-580.
41. Gomez-Manzano, C., Fueyo, J., Kyritsis, A.P., McDonnell, T.J., Steck, P.A., Levin, V.A., and Yung, W.K.
(1997). Characterization of p53 and p21 functional interactions in glioma cells en route to
apoptosis. J Natl Cancer Inst 89, 1036-1044.
42. Junge, C.E., Lee, C.J., Hubbard, K.B., Zhang, Z., Olson, J.J., Hepler, J.R., Brat, D.J., and Traynelis, S.F.
(2004). Protease-activated receptor-1 in human brain: localization and functional expression in
astrocytes. Exp Neurol 188, 94-103.
43. Kaufmann, R., Patt, S., Schafberg, H., Kalff, R., Neupert, G., and Nowak, G. (1998). Functional
thrombin receptor PAR1 in primary cultures of human glioblastoma cells. Neuroreport 9, 709712.
44. Smirnova, I.V., Zhang, S.X., Citron, B.A., Arnold, P.M., and Festoff, B.W. (1998). Thrombin is an
extracellular signal that activates intracellular death protease pathways inducing apoptosis in
model motor neurons. J Neurobiol 36, 64-80.
45. Turgeon, V.L., Lloyd, E.D., Wang, S., Festoff, B.W., and Houenou, L.J. (1998). Thrombin perturbs
neurite outgrowth and induces apoptotic cell death in enriched chick spinal motoneuron
cultures through caspase activation. J Neurosci 18, 6882-6891.
46. Guo, H., Liu, D., Gelbard, H., Cheng, T., Insalaco, R., Fernandez, J.A., Griffin, J.H., and Zlokovic, B.V.
(2004). Activated protein C prevents neuronal apoptosis via protease activated receptors 1 and
3. Neuron 41, 563-572.
47. Flynn, A.N., and Buret, A.G. (2004). Proteinase-activated receptor 1 (PAR-1) and cell apoptosis.
Apoptosis 9, 729-737.
48. Panner, A., Cribbs, L.L., Zainelli, G.M., Origitano, T.C., Singh, S., and Wurster, R.D. (2005). Variation of
T-type calcium channel protein expression affects cell division of cultured tumor cells. Cell
Calcium 37, 105-119.
49. Yamahata, H., Takeshima, H., Kuratsu, J., Sarker, K.P., Tanioka, K., Wakimaru, N., Nakata, M., Kitajima,
I., and Maruyama, I. (2002). The role of thrombin in the neo-vascularization of malignant
gliomas: an intrinsic modulator for the up-regulation of vascular endothelial growth factor. Int J
Oncol 20, 921-928.
50. Hua, Y., Tang, L., Keep, R.F., Schallert, T., Fewel, M.E., Muraszko, K.M., Hoff, J.T., and Xi, G. (2005).
The role of thrombin in gliomas. J Thromb Haemost 3, 1917-1923.
51. Hua, Y., Tang, L., Keep, R.F., Hoff, J.T., Heth, J., Xi, G., and Muraszko, K.M. (2008). Thrombin enhances
glioma growth. Acta Neurochir Suppl 102, 363-366.
52. Wang, J., and Lenardo, M.J. (2000). Roles of caspases in apoptosis, development, and cytokine
maturation revealed by homozygous gene deficiencies. J Cell Sci 113 ( Pt 5), 753-757.
53. Cohen, G.M. (1997). Caspases: the executioners of apoptosis. Biochem J 326 ( Pt 1), 1-16.
54. Gdynia, G., Grund, K., Eckert, A., Bock, B.C., Funke, B., Macher-Goeppinger, S., Sieber, S., HeroldMende, C., Wiestler, B., Wiestler, O.D., et al. (2007). Basal caspase activity promotes migration
and invasiveness in glioblastoma cells. Mol Cancer Res 5, 1232-1240.