Download A statistical framework for genome

A statistical framework for genome-wide association analysis of gene sets Qing Xiong: Institute for Genome Sciences & Policy, Duke University, Durham NC 27708. [email protected]. Nicola Ancona: Institute of Intelligent Systems for Automation National Research Council, Bari IT 70126. [email protected] Elizabeth R. Hauser: Section of Medical Genetics, Department of Medicine, Center for Human Genetics, Duke University Medical Center, Durham NC 27708. [email protected] Sayan Mukherjee: Departments of Statistical Science, Computer Science, Mathematics, Institute for Genome Sciences & Policy, Duke University, Durham NC 27708. [email protected] Corresponding author Terrence S. Furey: Institute for Genome Sciences & Policy, Duke University, Durham NC 27708. [email protected] Corresponding author Summary Since single-variant/gene analyses only account for a small proportion of phenotypic variation in complex traits and can result in a high rate of false positives, a variety of statistical or computational methods have been developed to identify gene expression or genetic variation associated with two experimental conditions in the pathway context. However, a computational platform for joint analysis of multiple types of genomic data in a single statistical framework is currently not available. We propose a novel methodology Gene Set Association Analysis (GSAA) that integrates gene expression analysis with genome wide association (GWA) studies to extract biological insight. GSAA yields insights by taking a priori defined gene sets – groups of genes that are putatively functional related, share similar location, or are commonly regulated – and inferring which sets of genes are both differentially expressed and contain associated genetic markers with respect to phenotypes or traits. Simulation studies illustrate the increase in power from a joint analysis of genomic data using GSAA with respect to gene set methods that use only one genomic source or a regression-based method. When applied to a real data set, our method can not only confirm the association findings disclosed by other canonical methods, but also identified a new candidate pathway significantly altered in glioblastoma which may suggest a potential core mechanism leading to the disease. In addition, our method can reveal potential genetic variants associated with gene expression variation and whether or not these functional variants are enriched in some specific pathways underlying complex traits. 1 Introduction Dissecting the genetic and molecular mechanism underlying complex traits and diseases has been one of the key scientific goals in the post-genomic era. Simultaneously measuring the degree of differential expression of genes and differential enrichment of genotypes or alleles in the genome between two phenotypic classes or two experimental conditions are two major methods for probing the correlation between gene/genetic variants and phenotype and for defining the genetic architecture passing down through generations (AWKWARD). The dominant paradigm in past decades has been to identify single genetic variants or genes most correlated with phenotypic class distinction or disease susceptibility. However, single locus/gene analysis in general identified only a few of most significant single nucleotide polymorphisms (SNP) or genes that account for a small proportion of phenotypic variation in complex traits or diseases. Recently there is an ever-increasing need for carrying out pathway/gene set-based analyses to more accurately and systematically identify the cellular processes altered in human diseases. Pathway-based approaches can be superior to single-factor analyses with respect to at least four aspects. First, complex traits or diseases are characterized by intricate interactions of multiple genetic variants, genes or even pathways. It is the joint action of a variety of multi-layer genetic/information structures such as transcriptional modules, signaling cascades, and metabolic pathways that eventually makes the phenotypes we observed in the nature. Pathway-based analyses can capture the differential activity of an entire structure associated with a binary trait and the interaction between distinct components thus more accurately measuring the impact of these genetic structures on traits. Second, it can detect modest or weakly coordinated changes in either gene expression or sequence variation in an a priori defined gene set. This joint analysis can elicit a significant biological effect even if changes in any individual gene have a small effect or is not significant at all. Importantly, this setting has been considered as being dominant in many pathological processes 1; 2. Third, single-factor analyses can result in a high rate of false positives because of the inherent noise in gene expression data or population substructure and locus heterogeneity in SNP data. Pathway-based analyses can weaken the negative effects of perturbations not associated with the trait of interest by inferring association from sets of biologically related genes therefore it can produce more consistent results across different studies. Fourth, it significantly facilitates the interpretation of the association findings by incorporating prior knowledge of biological pathways into association inference. A variety of statistical or computational methods have been developed to identify variation in pathway activity or function associated with a binary trait. The two types of genomic data used in these methods have been either gene expression data or SNP genotype data and they are usually analyzed separately. Accordingly, existing pathway-based approaches can be divided into two categories, expression-based and SNP-based. Most of approaches were designed for expression-oriented analyses. Gene set enrichment analysis (GSEA) adopted a weighted Kolmogorov-Smirnov (K-S) -like running-sum statistic to quantify enrichment of differential expression in a gene set, which reflects the degree to which the gene set is associated with a particular trait, by assessing whether genes in the gene set are randomly distributed throughout the entire ranked list of genes in the genome or are overrepresented at the top or bottom 3. GSEA employed a sample randomization strategy to assess the statistical significance of association findings. This strategy can preserve the complex correlation structure of the gene expression data therefore it is more suitable for biological experiments than gene randomization because the latter is based on an unrealistic independence assumption between genes 4. An extension of significance analysis of microarray (SAM), SAM-GS, also follows this paradigm of randomization. It tests a null hypothesis that the mean vectors of expressions of genes in a gene set do not differ by the phenotype of interest based on a SAM t-like statistic 5. Like SAM-GS, several other methods also use a mean-based test statistic for scoring gene set, e.f. parametric analysis of gene set enrichment (PAGE) 6, generally applicable gene-set enrichment (GAGE) 7, T-profiler 8, and random-set (RS) 9. The difference is that GAGE and T-profiler use two-sample t-test while PAGE and RS employ a one-sample z-test. Very different strategies were adopted by global test (GT) and ANOVA global test (AGT) 10; 11. These two global tests measure differential expression of a gene set by testing whether similar class labels are associated with similar gene expression patterns in the gene set based on a logistic regression model and a ANOVA model, respectively. Gene list analysis with prediction accuracy (GLAPA) was also developed to assess deregulation of a gene set by using a predictive statistic based on the prediction accuracy of the phenotype of new subjects 12. Each method has its advantages and limitations, detailed discussion about issues and underlying assumptions can be found in several comparative studies and reviews 4; 13-16. Comparatively fewer SNP-based methods have been proposed. In SNP data, each gene is represented by a varied number of SNPs, so representation of a gene by a single value is essential. Wang et al. 17 first tailored the GSEA strategy to suit SNP data. They mapped SNPs to their closest gene and represented the gene by the maximum test statistic value among all SNPs mapped to it. Peng et al. 18 considered only those SNPs within the gene. They combine p-values of all SNPs within the gene into an overall p-value for the gene and then p-values of all genes in the pathway into an overall p-value for the pathway. This approach assumes that SNPs in the gene are independent so it can be applied to haplotype-tagging SNP data. Two computational tools, SNP ratio test (SRT) 19 and GSEA_SNP 20, are also available for SNP-oriented pathway analyses. SRT calculates an empirical p-value for each pathway by comparing the ratio of the proportion of significant SNPs to all SNPs within genes that are part of the pathway to a null ratio distribution generated by phenotype randomization of the dataset. GSEA_SNP is an extension of GSEA algorithms. It converts the gene set to a corresponding SNP set and then employs a same procedure as GSEA to test whether SNPs in the SNP set are significantly enriched at the top of the ranked list of all SNPs. A recent paper 21 proposed an eSNP-based approach which first identifies a group of SNPs that are associated with the change of gene expression, called eSNPs and then examines whether these eSNPs are enriched in some specific pathways. This method doesn’t incorporate gene expression information into the inference of association of pathways. It uses gene expression information as only a tool for filtering out those SNPs not driving the change of gene expression so basically it is primarily a one-source pathway analysis. Expression-based and SNP-based pathway analyses are becoming increasingly popular. However, an integrative platform for combining these two forms of genomic data in a single analysis is currently not available. This is the central motivation of this study. We argue that integrating these two types of heterogeneous but complementary data can provide greater information on the possible molecular pathways underlying phenotypic class distinction and make the inference of association more robust and comprehensive. In this study, we extended GSEA to incorporate SNP data. Like GSEA, we test the null hypothesis that genes in a gene set are randomly distributed over the list of all genes ranked by their correlation with the phenotype. The novel idea is to combine two types of evidences to infer association of genes/gene sets with the phenotypic categories. One source of evidence is derived from an expressionbased test which gauges the enrichment of differential expression in the gene set, the other is from a SNPbased test which assesses the enrichment of significant associations in the gene set. We calculate an overall enrichment score (ES) for each gene set by a weighted K-S-like statistic to measure the aggregated effect of multiple correlated members in the set while filtering out or reducing the effect of pseudo signals or noise. Our method has three advantages compared to other pathway-based approaches solely based on gene expression data or SNP data: 1) joint analysis of gene expression variation and sequence variation can increase the likelihood to identify genuine association signals and reduce the effects of inherent noise in gene expression and SNP data on association inference; 2) variation in sequence can be the fundamental cause driving the change of gene expression directly or indirectly, so the integrative analysis assists in identifying genetic variants associated with gene expression variation and phenotypic variation in the pathway context and whether or not these functional variants are enriched in some specific pathways underlying complex traits or diseases; 3) our method assesses the enrichment of differentially expressed genes and the enrichment of statistically significant SNPs in a gene set by a single test. This greatly facilitates interpreting association findings and elucidating the regulatory mechanism in the pathway. We have developed a Java-based software, called gene set association analysis (GSAA) based on our algorithms and it is freely available at gsaa.igsp.duke.edu (not sure). The software includes a user-friendly and straightforward graphical user interface and provides full support for the visualization of results. A separate module called gene set association analysis-SNP (GSAA-SNP) is also offered for pathway-based analysis solely based on SNP genotype data. 2 Method The strategy we adopted for gene set association analysis is a model based on multi-layer association tests. The advantage of this model is it can effectively capture the association signal carried by the expression profile of single gene and the association signal carried by the genotypes of single SNP and forms a chain of evidence from SNP to gene and then from gene to gene set that results in the inference of associations of gene sets that are more robust and comprehensive. Figure 1 is an overview of the method. Figure 1. The overview of GSAA 2.1 multi-layer association tests GSAA infers association of gene sets by a series of differential expression test of genes and differential enrichment tests of genotypes or alleles between two groups of samples belonging to two phenotypic classes, C1 and C2 . 2.1.1 Differential gene expression test expression can be measured by any suitable metric. In this paper, the test statistic used is the  Differential  difference of means scaled by the standard deviation r 1   2 1   2 where 1 and  1 is the mean and standard deviation for class C1 , and  2 and  2 for class C2 . The larger the absolute value of the test statistic, the stronger the gene expression is associated with the phenotypic class distinction.  by GSEA. In the software, we provide the same eight statistics for differential expression as provided 2.1.2 Single-locus association test In our software, single-locus association analysis can be carried out by three different statistics, a genotype-based chi-square statistic, an allele-based chi-square statistic, and the difference of major/minor alleles between two classes. We use a chi-square two sample test which is based on binned data. The basic idea behind the chi-square two sample test is that the observed frequency in each bin should be similar if the two data samples come from common distributions. If the data is divided into k bins then the test statistic is calculated as ( K R  K 2 Si ) 2   1 i , Ri  Si i 1 k 2   k where K1  i 1 k i 1 Si Ri , K2  1 K1 where Ri is the observed frequency for bin i for class C1 , and S i is the observed frequency for bin i for class C2 . K1 and K 2 are scaling constants that are used to adjust for unequal sample sizes. Suppose that the genotypes in the SNP dataset are coded as AA, AB, and BB. For the genotype-based chi square two sample test, the three bins are AA, AB, and BB. For the allele-based chi-square two sample   test, thetwo bins are A and B. The test statistic value represents the degree to which the SNP is correlated with phenotypic class distinction. Larger values mean more significant correlation. For all data analyses in this paper, we used genotype-based single-locus association test because it can better capture the interaction between alleles at a single locus. Simulations also indicated that results from the genotypebased test were slightly better than those from the allele-based test. Other test statistics for single-locus analysis can easily be incorporated to the analysis. 2.1.3 SNP set association test In SNP data, each gene and its regulatory regions are usually covered by multiple SNPs. In order to test the association of genes with the phenotype, the first step is to map all SNPs in the SNP dataset to all genes in the expression dataset. In our software users can specify how many base pairs (bp) upstream and/or downstream of a gene to include in establishing the interval for SNP-gene mapping together with the region within the gene. Then we use all SNPs within the interval to represent the gene. In the software we designed three statistics for SNP set association test, the maximum statistic, the mean statistic, and a weighted K-S-like statistic, discussed in detail in 2.1.5. The SNP set association test is based on the results from single-locus analysis. For the maximum statistic, the maximum single-locus test statistic among all SNPs mapped to the gene is used as to score evidence for the SNP set. For the mean statistic, we take the average of all the single-locus test statistics for SNPs mapped to the gene as a score. In this paper, we use the maximum statistic as the statistic for testing SNP set association since in the simulations this statistic performs best among the three statistics. Other statistics for SNP set association analysis can be used. 2.1.4 Gene association test For each gene, we have two correlation scores both of which reflect the correlation of the gene with a phenotype. One score is from an expression-based test and the other is from a SNP-based test. First, we calculate a normalized correlation score. Given a dataset with N genes, the normalized correlation score of gene i is defined by nci  ci | cj | jN where ci or c j is the raw score from either the expression-based association test or the SNP-based association test, c j ≥0 if ci ≥0 and c j <0 if ci <0 (NOT CLEAR). The correlation score from the expression-based test has a sign indicating the direction of differential expression: “+” for overexpression in class C1 , “-” for overexpression in class C2 . The SNP-based association test does not have directionality. So we assign to the SNP-based test the same direction of correlation to the phenotype as found from the expression-based test. (CHECK TO BE SURE THIS IS CORRECT)   Finally, we take the sum of the two normalized correlation scores from expression-based test and SNPbased test as the correlation score of the gene. 2.1.5 Gene set association test We ranked all of the genes in the dataset in descending order of their correlation scores. Then we evaluate whether the genes within a gene set are enriched at the top or bottom of the ranked list of genes. The statistic for gene set enrichment analysis is a weighted K-S-like running-sum statistic. Enrichment scores are calculated by walking down the ranked list, increasing the statistic value when a gene is in the gene set and decreasing it when it is not. The magnitude of the increment depends on the correlation of the gene with the phenotype. The ES is the maximum deviation from zero encountered in walking the list. Given a gene set S with H genes, ES 3; 17 is defined by   p 1   | rj | ES ( S )  max   , 1i  N N N  N j  S j  S R H   j i j i   where N R  | r jS j |p where r j is the correlation score of gene j , N is the number of genes in the dataset, N H is the number of genes in the gene set S , p is a parameter that gives higher weight to genes with extreme statistic values, here we follow GSEA to set p  1 . If the gene set is not associated with the phenotypic class, the genes within it should be randomly distributed over the ranked list of all genes. In this case, the ES will be small. The ES will be high if the genes within the gene set are overrepresented at the top or bottom of the ranked list, in which case the gene set is associated with the phenotype. 2.2 Assessment of statistical significance and adjustment for multiple hypothesis testing We assess the statistical significance of the ES and adjust for multiple hypothesis testing based on a phenotype-based permutation procedure since it can preserve linkage disequilibrium (LD) structure in SNP data and gene-gene correlation structure in gene expression data. A nominal P-value of the ES is calculated relative to a null distribution generated by shuffling the phenotypic class labels and recalculating the ES many times. As in GSEA, we use the false discovery rate ( FDR ) and the family-wise error rate ( FWER ) to correct for multiple hypothesis testing and control the proportion of false positives below a certain threshold. The difference is that the calculation of the FDR and FWER in GSEA is based on the normalized enrichment score ( NES ). We used the actual ES because our simulations indicated that the NES can have problems in certain situations. Given a gene set S * and a dataset D , the FDR is calculated as FDR( S * ) = % of all (S,  ) with ES(S, )  ES * % of all(S, D) with ES(S, D)  ES * where  denotes a permutation, all (S,  ) denotes all gene sets against all permutations of the dataset, * all(S, D) denotes all gene sets against the actual dataset D , ES is the enrichment score of gene set S * . FWER is calculated as FWER(S * ) = % of all  with highest ES(S,  )  ES * (WHAT IS HIGHEST OVER? NOT CLEAR) 2.3 Gene set association analysis-SNP (GSAA-SNP) In our software, we designed  a separate module called GSAA-SNP which is used for gene set association analysis solely based on SNP genotype data. In GSAA-SNP we removed the differential gene expression test and kept the other components the same as GSAA. 2.4 Logistic regression-based gene set association analysis-SNP (LRGSAA-SNP) Logistic regression-based approaches are widely used in association studies. In order to compare our method with logistic regression-based approaches, we developed LRGSAA-SNP for logistic regressionbased pathway analysis solely based on SNP data. The following is a brief description about the principle behind LRGSAA-SNP: Step1: Map all SNPs in the dataset to all genes in the genome, then calculate a chi-square statistic value for each SNP which reflects its correlation with the phenotype; Step2: Represent the gene by the SNP with the highest correlation statistic among all SNPs mapped to the gene, then calculate a P-value for each gene based on a phenotype-base permutation procedure, and then remove those genes with P-value larger than a cutoff threshold (0.05); Step3: Logistic regression on the remaining genes. Here the response variable is the phenotypic class label, and the explanatory variables are genotypes. We then obtained a regression coefficient for each gene. A regression coefficient reflects the degree to which the gene is associated with the phenotypic class. Step4: Quantify the correlation of gene sets with phenotype based on these regression coefficients. Specifically, we use the sum of the absolute values of regression coefficients of all genes in the gene set as the correlation score of that gene set. Step5: Permute the phenotypic class labels and then repeat step 3 and step 4 to generate a null distribution of correlation score for each gene set, then calculate nominal P-value, FDR, and FWER using the same method as GSAA. 3 Results 3.1 Simulation study We conducted a comprehensive simulation study to evaluate the ability to identify causal gene sets for four different pathway based approaches: 1) GSAA: joint association test based on gene expression and SNPs; 2) GSEA: an expression based association test; 3) GSAA-SNP: a SNP based association test using a weighted Kolmogorov-Smirnov (K-S)-like running-sum statistic; 4) LRGSAA-SNP: a SNP based association test using logistic regression. The questions we want to address in the simulation study are 1) whether we can increase the power of association tests by integrating expression and genotypic data into pathway based approaches; 2) how parameters, such as sample size, disease model, magnitude of effect of risk genes or genotypes or alleles, linkage disequilibrium (LD), and signal intensity in the gene set affect the performance of the four approaches; 3) what are the advantages of GSAA compared with the other three pathway based approaches. 3.1.1 Simulated dataset We examine four sample sizes, 100, 200, 400, and 1200 with the same number of samples in two phenotypic classes - case versus control for the simulation study. Simulated datasets of gene expression, SNP, and gene sets were generated by the following criteria: 3.1.1.1 Gene expression dataset Each gene expression dataset includes 1000 genes. Only first 20 genes are causal genes. Gene expression values were drawn from normal distributions. Three different scenarios were simulated: 1) Expression values of causal genes are drawn from N(10.5, 1) in the case group and from N(10, 1) for all other genes in both groups; 2) Expression values of causal genes are drawn from N(10.3, 1) in the case group and N(10, 1) for all other genes in both groups; 3) Expression values of causal genes are drawn from N(10.1, 1) in the case group and N(10, 1) for all other genes in both groups. 3.1.1.2 SNP dataset Each SNP dataset includes 1000 genes. The first 20 genes are causal genes. Each causal gene covers three SNP markers, only the second marker is in LD with the disease variant. All of other genes also have three SNP markers, but none of them is in LD with the disease variant. Genotype data were generated by SIMLA 22. We first generated genotype data for pedigrees and then took the proband of each pedigree to form unrelated population samples. Parameters for SIMLA were calculated based on a susceptibility locus rs17221417 in a published dataset of Crohn’s disease (CD) 23. This locus represents caspase recruitment domain-containing protein 15 (CARD15, also called NOD2) which is the first confirmed CDsusceptibility gene 24; 25. Risk allele frequency is 0.287. Homozygote and heterozygote genotype relative risk are 1.617 and 1.08, respectively. We set disease prevalence to 0.001985 according to a report 26. (WHAT REPORT AND WHY?) In this simulation study, we detect causal genes by indirect association, based on the LD between markers and causal variants. There are five different scenarios: 1) R2 between the disease variant and the second marker is 1 for all causal genes; 2) R2 between the disease variant and the second marker is 0.9 for all causal genes; 3) R2 between the disease variant and the second marker is 0.7 for all causal genes; 4) R2 between the disease variant and the second marker is 0.5 for all causal genes; 5) R2 between the disease variant and the second marker is 0.3 for all causal genes. 3.1.1.3 Gene sets We generated 100 gene sets, each with 20 genes. Only the first gene set includes causal genes. In the simulations, we assume that all of causal genes differ from other genes in both expression and sequence pattern. There are four different scenarios: 1) The first gene set includes 20 causal genes; 2) The first gene set includes 15 causal genes; 3) The first gene set includes 10 causal genes; 4) The first gene set includes 5 causal genes. 3.1.2 Simulation results For each simulation, we carried out 2000 permutations of phenotype labels to compute p-values, FDR, and FWER. In the first stage, we repeated 30 simulations for each scenario. We did not run LRGSAASNP for sample size 100 because in this case the number of explanatory variables exceeds the number of samples. Usually, there are ~150 SNPs with p-value less than 0.05 in each of simulated SNP dataset. The mean of p-values, FDRs, and FWERs of causal gene set over 30 simulations are reported in Table S1-S8. Power was calculated as the proportion of repetitions with p-value for causal gene set less than the specified significance level (0.05). Power for over 30 simulations are reported in Document S1. In the second stage, we repeated 200 simulations only for scenario 3 of gene expression and scenario 3 and 4 of for SNPs for a dominant model and sample size 200 and 400. The main goal of the second stage is to more accurately compare the ability of different approaches to detect subtle association signals. Table 1 shows the mean of p-values, FDRs, and FWERs as well as power for causal gene sets over 200 simulations. In the second stage analysis, we discarded LRGSAA-SNP since it did not perform well in the first stage. Table 1. Comparison of results of GSAA, GSEA, and GSAA-SNP under dominant disease model for 200 repetitions when expression level of risk genes is 0.1-unit higher GSAA PORG* P FDR FWER Power Sample size = 200 & R2 =0.7 1 0.0199 0.2218 0.2241 0.91 0.75 0.0691 0.4053 0.4407 0.775 0.5 0.207 0.6844 0.7364 0.4 0.25 0.4481 0.9242 0.95 0.145 Sample size = 200 & R2 =0.5 1 0.0333 0.2698 0.2795 0.875 0.75 0.0883 0.4722 0.4989 0.705 0.5 0.2371 0.7169 0.7738 0.385 0.25 0.4945 0.9214 0.9514 0.115 Sample size = 400 & R2 =0.7 1 0.0003 0.0106 0.0129 1 0.75 0.0039 0.0815 0.0812 0.975 0.5 0.0336 0.3233 0.3454 0.85 0.25 0.2486 0.734 0.779 0.375 Sample size = 400 & R2 =0.5 1 0.0008 0.0248 0.0249 1 0.75 0.0106 0.1288 0.1386 0.955 0.5 0.0689 0.4324 0.4736 0.74 0.25 0.3007 0.8 0.8364 0.31 * Percentage of risk genes in the gene set. P GSEA FDR FWER Power P GSAA-SNP FDR FWER Power 0.0377 0.107 0.2983 0.5277 0.3108 0.5847 0.7696 0.9135 0.335 0.6263 0.8431 0.9663 0.845 0.59 0.29 0.085 0.0148 0.0422 0.1078 0.2862 0.2876 0.478 0.705 0.8995 0.3171 0.5241 0.7597 0.9409 0.905 0.775 0.535 0.175 0.0377 0.107 0.2983 0.5277 0.3108 0.5847 0.7696 0.9135 0.335 0.6263 0.8431 0.9663 0.845 0.59 0.29 0.085 0.0737 0.1186 0.2272 0.3482 0.6198 0.6994 0.8642 0.9081 0.6825 0.7939 0.9159 0.9507 0.69 0.57 0.285 0.17 0.0033 0.0322 0.1373 0.4333 0.0693 0.3104 0.6675 0.8813 0.0758 0.3428 0.7179 0.9322 0.98 0.86 0.49 0.15 0 0.0007 0.0141 0.1011 0.0028 0.0416 0.2572 0.7203 0.0027 0.043 0.2563 0.7477 1 1 0.93 0.58 0.0033 0.0322 0.1373 0.4333 0.0693 0.3104 0.6675 0.8813 0.0758 0.3428 0.7179 0.9322 0.98 0.86 0.49 0.15 0.0018 0.0075 0.0511 0.1919 0.0745 0.2297 0.5502 0.8286 0.0725 0.2379 0.5755 0.8649 0.995 0.965 0.745 0.37 Results from 200 repetitions and 30 repetitions show a consistent trend. GSAA can indeed increase the power to detect association signals by integrating expression based association tests and SNP based association tests into a single statistical framework as compared to GSEA and GSAA-SNP. This is more obvious when association signals are subtle. GSAA gives higher scores to those genes with significant alterations in both expression profiles and sequence, which help identify real associations between genes and the particular phenotype being studied. Both mutations in coding sequences and in regulatory regions can cause human genetic disease. Variation in regulatory regions is a common primary mechanism driving the changes of gene expression, so for some of important disease related genes, aberrations in both expression level and sequence are detectable. Some investigations have demonstrated that eSNPs or expression quantitative trait loci (eQTL) or differentially expressed genes are more likely to be detected as disease variants in association studies 27-29. A joint test of expression variation and genotypic variation can greatly increase the ability to pinpoint causal genes and pathways. In addition, gene expression is a dynamic process, which can be influenced by numerous genetic and environmental factors. Sequence variation is relatively stable, so incorporating genotypic information can reduce the effect of noise inherent in gene expression data for inference of associations. Mutually, the differences in expression levels can also help filter out pseudo signals in genotypic data effectively. In altered pathways, some genes may harbor non-synonymous variants that alter the amino acid sequence of the encoded protein and thus the structure and activity. Most of these variants have a functional effect on the phenotype. Some of these kinds of causal genes may not show detectable differences in expression levels therefore expression based approaches may miss them. However, GSAA can capture these kinds of association signals through the SNP based test. Although in each of the simulated SNP datasets there are ~150 SNPs with p-value less than 0.05, of which ~ 20 are located within causal genes, in most of the situations, GSAA can very accurately pick out the causal gene set from 99 random gene sets and the p-value, FDR, and FWER of causal gene set are much smaller than those of other gene sets, (SEE SUPLEMENTAL ###). Sample size, disease models, and the magnitude of effect of risk gene sets, genes, or genotypes are important factors that affect the performance of GSAA. As expected, larger sample size results in greater statistical reliability. In simulations, every method achieves a power of 100% at sample size=1200 under the dominant disease model when the association signal is strong in the gene set. Even for the weakest association signals, namely the expression level of risk genes is 0.1-unit higher than other genes and R2 between disease locus and the second marker of risk genes is 0.3, GSAA maintains 100% power while there are 50% or more causal genes in the gene set (Figure 2A). At sample size 100, the power of GSAASNP decreases to 70% when R2 is 0.9 and all genes are causal genes in the gene set (Figure 2B). However, by combining the information from expression based tests and SNP based tests, GSAA still has power to detect real associations. In simulations, we used dominant and recessive disease models. It is more difficult to detect associations under the recessive model. In this case GSAA has an obvious advantage by borrowing information from gene expression, which is not directly influenced by disease models. Interestingly, LRGSAA-SNP lacks power to detect any level of association under the recessive model. Three levels of degrees of differential gene expression, five levels of LD, and four levels of signal intensity in risk gene set are designed to evaluate the relationship between effect size of risk gene sets, genes or genotypes and the performance of GSAA. Detailed results are reported in Table S1-S8 and Document S1. Overall, power increases with increasing effect size. A Sample size=1200; 0.1-unit upregulation in gene expression; dominant disease model; R2=0.3 1 Power 0.8 GSAA 0.6 GSEA 0.4 GSAA_SNP 0.2 LRGSAA_SNP 0 0 0.25 0.5 0.75 1 Percentage of risk genes in the gene set B Sample size=100; 0.3-unit upregulation in gene expression; dominant disease model; R2=0.9 1 Power 0.8 0.6 0.4 GSAA 0.2 GSEA GSAA_SNP 0 0 0.25 0.5 0.75 1 Percentage of risk genes in the gene set Figure 2. Power at the 0.05 significance level for the association test of gene sets by four different pathway based approaches. (A) power at R2=0.3, sample size=1200, 0.1-unit up-regulation in gene expression under dominant disease model, (B) power at R2=0.9, sample size=100, 0.3-unit up-regulation in gene expression under dominant disease model. Logistic regression-based approaches are widely used in association studies. We developed LRGSAASNP to test the power of logistic regression in pathway based approaches and compare it with the three KS based approaches. The results indicated that LRGSAA-SNP is very sensitive to sample size. It reaches a power of 100% only at sample size = 1200 with 600 cases and 600 controls for the dominant disease model. In simulations, three K-S based approaches have greater power as compared with LRGSAA-SNP in the identification of causal pathways. They can even detect very subtle signals with 0.1-unit upregulation in the expression of risk genes or R2=0.3 between marker and causal variant when sample size is large enough. This demonstrates that our method is valuable for pinpointing the causes of complex diseases for which more subtle changes may dominate 1 the disease process. In addition, compared with logistic regression based approaches, GSAA is more practical for genome-wide association analysis of gene sets of high-density SNP arrays. In simulations, GSAA shows superior power at sample size 200 under dominant model in most of scenarios. More importantly, the power of GSAA is not influenced by the number of SNPs in the SNP dataset. However, the number of SNPs in the SNP dataset can profoundly affect the performance of logistic regression based methods. The number of explanatory variables increases with the increase in the number of SNPs studied, which in turn requires the increase of sample size. Peduzzi et al. 30 demonstrated that the number of events/samples per variable should be more than or equal to 10 for optimal performance of logistic regression. This means that LRGSAA-SNP need at least 1500 and 15000 samples to gain enough power for reliable association tests for two SNP datasets that include 150 and 1500 possible risk SNPs with p-value < 0.05 respectively while GSAA only need 200~400 samples to reach the same power for both datasets. Actually, the number of possible risk SNPs being studied in a real SNP dataset from a high-density SNP array, for example Genome-Wide Human SNP Array 6.0, may be much more than 1500. This results in that it is almost infeasible to do genomewide analysis of gene sets using LRGSAA-SNP. (QING LET US HAVE A BRIEF CONVERSATION ON THIS SECTION, MAINLY TO MAKE IT SHORTER). 3.2 Application to data from The Cancer Genome Atlas (TCGA) pilot project We applied our method to a published dataset of human glioblastoma31, the most common type of primary adult brain cancer, with both gene expression data and SNP genotype data available from TCGA pilot project (http://cancergenome.nih.gov/). The expression dataset includes 258 tumor samples and 11 normal samples, and the SNP dataset has 205 tumor samples and 89 normal samples. We used the gene region plus 1 kilo base pair (kb) upstream of the transcription start site (TSS) to establish the association region of the gene. Although the size of 1kb upstream the TSS might be insufficient to cover all possible regulatory regions for all genes, it could reasonably include both core and proximal promoters and at least part of the distal promoter32. Two datasets of gene sets from the Molecular Signatures Database (MSigDB http://www.broadinstitute.org/gsea/msigdb/index.jsp) were used in this analysis. One includes 639 canonical pathways; another has 1454 gene sets derived from the Gene Ontology (GO) project. Those gene sets with the number of genes less than 15 were excluded from our analysis. 10000 permutations were used to assess the statistical significance of enrichment scores of gene sets and adjust for multiple hypothesis testing. 3.2.1 Significant pathways In the dataset of canonical pathways, we identified 16 pathways enriched in tumor samples with FDR≤0.25 (Table 2). The most significant pathway is P53PATHWAY. There are 15 genes in this pathway (Document S1). Four genes, tumor protein p53 (TP53), retinoblastoma (RB1), E2F transcription factor 1 (E2F1), and mouse double minute 2 homolog (MDM2) show significance from SNP based test. Six genes, TP53, RB1, cyclin-dependent kinase 2 (CDK2), cyclin-dependent kinase 4 (CDK4), proliferating cell nuclear antigen (PCNA) and cyclin-dependent kinase inhibitor 1A (CDKN1A, also called p21) are highly differentially expressed between tumor and normal samples. All of these genes have been experimentally confirmed to have a role in the development and/or progression of glioblastoma33-41. Among them, TP53 and RB1 are well-known key players. Our analysis indicated that these two genes are significantly associated with tumor samples in both gene expression based test and SNP based test. Through the integrative platform, we can examine both gene expression and nucleotide sequence aberrations in a gene set thus help better understanding the mechanism driving phenotypic change in diseases. Also our method can highlight those key genes with significant changes in both gene expression and sequence such as TP53 and RB1. Table 2. Most significant canonical pathways enriched in tumor samples with FDR≤0.25 Gene set name P53PATHWAY RELAPATHWAY CASPASEPATHWAY ATRBRCAPATHWAY HSA04115_P53_SIGNALING_PATHWAY G1PATHWAY HSA03030_DNA_POLYMERASE DNA_REPLICATION_REACTOME G2PATHWAY CELL_CYCLE_KEGG G1_TO_S_CELL_CYCLE_REACTOME TIDPATHWAY TNFR2PATHWAY MITOCHONDRIAPATHWAY STATIN_PATHWAY_PHARMGKB HSA04610_COMPLEMENT_AND_COAGULATION_CASCADES Size 15 16 22 18 60 23 21 43 21 78 61 17 18 20 16 64 Nominal P 0.021 0.013 0.018 0.097 0.007 0.063 0.124 0.199 0.059 0.069 0.150 0.070 0.080 0.023 0.026 0.028 FDR 0.130 0.157 0.165 0.169 0.182 0.190 0.205 0.209 0.216 0.223 0.249 0.249 0.249 0.249 0.250 0.250 For detailed results, see Table S9 and Table S10 in the supplementary materials. 51 gene sets are significantly associated with tumor samples in the dataset of GO gene sets with P<0.05 (Table S11). But after correction for multiple hypothesis testing, none of these association reach significance (FDR≤0.25). 40 gene sets are significantly enriched in normal samples at FDR≤0.25 (Table S12). The top three gene sets are VOLTAGE_GATED_CALCIUM_CHANNEL_COMPLEX (FDR=0.002), VOLTAGE_GATED_CALCIUM_CHANNEL_ACTIVITY (FDR=0.004), and CALCIUM_CHANNEL_ACTIVITY (FDR=0.01), implicating a possible deregulation of voltage-gated calcium channel in glioblastoma. 3.2.2 Genes enriched in the leading edge subsets of top 20 pathways We noticed that some genes play roles in multiple pathways. We are more interested in these types of hub genes because 1) major-effect hub genes are important contributors to the enrichment scores of multiple top-ranked pathways and they may be the main reason why these pathways exhibit significant association with the phenotype; 2) some of hub genes may not be a major-effect risk gene in a single pathway, but the cumulative effect in multiple pathways makes them critical to the phenotype because they connect multiple phenotype associated pathways, and alteration of these genes may significantly affect the phenotype in multiple aspects; 3) multi-pathway analysis can greatly increase the likelihood of detecting true phenotype-associated genes as currently pathway annotation is not accurate and complete so those multi-pathway players may represent more reliable signals and most likely have real functions in phenotype being studied. To identify a core set of hub genes involved in glioblastoma, we did leading edge analyses for the top 20 pathways and then count the number of occurrence of each gene in leading edge subsets of top 20 pathways. Figure S1 and S2 (also see Table S13-S14) show the distribution of the number of occurrence, correlation scores in gene expression based test and P values in SNP based test for those hub genes appearing in the leading edge subsets of at least two top pathways. We identified two sets including 46 and 51 genes for datasets of canonical pathways and GO gene sets, respectively. Then we evaluated if this two sets of hub genes include core components of major pathways altered in human glioblastoma by comparing them with those genes from three core pathways: p53 signaling, RB signaling, and RTK/RAS/PI(3)K signaling, which were defined by a recent paper based on an comprehensive analysis of DNA copy number, gene expression, DNA methylation and nucleotide sequence aberrations 31. We found that genes in p53 signaling and RB signaling pathways are indeed enriched in our core set of hub genes. Our core set covers 7 out of 10, including cyclin-dependent kinase inhibitor 2A (CDKN2A), MDM2, TP53, cyclin-dependent kinase inhibitor 2C (CDKN2C), CDK4, cyclin d2 (CCND2), and RB1. In addition, the activation and sequence aberrations of epidermal growth factor receptor (EGFR) in RTK/RAS/PI(3)K signaling were confirmed by our joint association test of gene expression profiles and SNP genotypes as well. Surprisingly, we observed that genes involved in apoptosis signaling pathway are strikingly enriched in the core set, indicating the deregulation of apoptosis signaling pathway could be a major factor leading to the development of glioblastoma. 3.2.3 Suggested new core pathway in glioblastoma Our pathway based association analysis not only confirmed the critical roles of those three core pathways revealed by the Cancer Genome Atlas Research Network but also uncovered a new possible core pathway in glioblastoma. It is even more significant than those three in our results. Our analysis indicated that this new pathway may include at least three members: coagulation factor II receptor (F2R) which more usually called protease-activated receptor 1 (PAR1), caspase 8 (CASP8), and caspase 3 (CASP3). We call this pathway as PAR/CASP signaling. In our statistical test, PAR1 is most significantly altered in both expression and sequence in tumor samples. It obtained highest correlation score from expression based test and lowest P value (p<10-15) from SNP based test among more than 7000 genes (Table S15-S18). It also occurs in seven of leading edge subsets of top 20 Go gene sets. PAR1 is expressed throughout the brain on neurons and astrocytes and has important roles in the central nervous system (CNS) 42. PAR1 is also functional in human glioblastoma cells 42; 43. Except for its well known roles in coagulation and hemostasis, PAR1 has been found to induce or inhibit apoptosis depending on the dosage of its physiological agonist thrombin and can mediate anti-apoptotic signaling in nervous system through the interaction with activated protein C (APC) and caspases (more likely CASP8 and CASP3) 44-47. Junge et al. reported that human glioblastoma cells respond to PAR1 activation by increasing intracellular Ca2+ 42. Our results indeed indicate that three calcium channel pathways are significantly altered (Table S4). Interestingly, most of genes of calcium channel, voltage-dependent (CACN) family in these pathways are down-regulated in tumor samples. This might be an adaption to the increase of Ca2+ triggered by the activation of PAR1.There are also big differences of genotypes of these genes between tumor and normal samples. Panner et al. 48 demonstrated the involvement of T-type Ca2+ channel in the proliferation of both glioma cells and neuroblastoma cells. In addition, several investigations observed a link between thrombin and glioma pathogenesis 49-51 Glioblastomas are the most malignant brain tumors, which are characterized by cellular resistance to apoptosis and a highly invasive growth pattern. Failure of apoptosis is one of the main contributions to tumorigenesis and caspases play essential roles in apoptosis, necrosis and inflammation 52; 53. Georg et al. 54 found that human glioblastoma cells exhibit a constitutive activation of caspases in vivo and in vitro. The inhibition of CASP3 and CASP8 decreases the migration and invasiveness of glioma cells. In our analysis, the expression of CASP3 and CASP8 is markedly up-regulated in tumor samples. CASP8 shows significant change in sequence as well (P<10-15). We also noticed the activation of other caspases, CASP1, CASP4, CASP6 and CASP7. In addition, CASPASEPATHWAY is significantly associated with tumor samples in the dataset of canonical pathways. POSITIVE_REGULATION_OF_CASPASE_ACTIVITY pathway is also enriched in tumor samples in the dataset of GO gene sets. CASP3 and CASP8 are involved in 9 and 8 of top 20 pathways, respectively. Other genes in apoptosis signaling pathway, for example Fas-associated via death domain (FADD), tumor necrosis factor receptor superfamily, member 6 (TNFRSF6 or FAS), BCL2-associated X protein (BAX) et al, are also enriched in the leading edge subset of top 20 pathways. These results suggested the deregulation of signal transduction pathways involved in apoptosis in glioblastoma. PAR1 and caspases may be important components related to the deregulation of apoptosis in glioblatoma. PAR1 is an upstream regulator of the activity of caspases, so our analysis may uncover a correlation between PAR1 and caspases as well as their collaboration mediating anti-apoptotic signal leading to disease. Since the mechanism underlying strong resistance of glioma cells to apoptosis is still partly understood, our analysis could provide new insights into the pathogenesis of glioblastoma and new therapeutic targets in the treatment of glioblastoma. Whether or not PAR/CASP signaling has an important role in the initiation and/or progression of glioblastoma merits further experimental investigations. Since the Cancer Genome Atlas Research Network didn’t mention caspase pathway or caspases in their original publication, so our pathway based approach can complement other canonical methods to provide more powerful statistical framework to detect and analyze the variation of gene expression, sequence, or both that lead to glioblastoma and other complex diseases. 4 Discussion Genome-wide gene expression profiling and genotyping offer unparalleled opportunities to pinpoint genomic and genetic determinants of complex traits or diseases and to elucidate the interaction network among them at the genome level. Integrating these two types of distinct but complementary data into a single analysis can enhance genuine association signals and increase statistical power for pathway-based association tests. It can also suggest cis- and trans-regulatory variation associated with expression variation of genes in the altered pathway and offer a deeper understanding of complex diseases. In this study, we developed a novel statistical framework to integrate gene expression data and SNP data into genome-wide association analysis of gene sets. The simulations indicated that the integrative analysis can indeed increase the ability to detect real association signals. When applied to a real data set, our method can not only confirm the association findings unraveled by other canonical methods, but also identified a new candidate pathway significantly altered in glioblastoma which may suggest a potential core mechanism leading to the disease. Results from both simulated data and real data demonstrates that the integrating various types of –omics data into a single statistic framework for gene set association analysis is a promising field that is well worth further study. Here we discuss some open questions concerning our methodology and implementation, pointing out the potential rooms for improvement and expecting a more powerful solution offered in the future. 4.1 Genotypic association test and allelic association test In our software, three options for single locus association analysis are available, genotype-based chisquare test, allele-based chi-square test, and frequency difference of major/minor alleles between cases and controls. In this paper we only used a genotype-based chi-square two sample test statistic to determine genotype-phenotype correlation of each SNP. The simulations show that in GSAA genotypic association test can obtain better results compared with allelic association test (Table 3). The possible explanation may be that genotypes represent individuals. Traits are expressed at individual level. Genotypic test can capture the interaction between two alleles and more accurately assess the joint effect of two alleles in a single locus. Table 3. Comparison of GSAA results based on genotypic test and allelic test under dominant disease model when expression level of risk genes is 0.3-unit higher and sample size is 400 R2=0.9 R2=0.7 Genotypic test Allelic test Genotypic test Allelic test PORG* P FDR FWER P FDR FWER PORG* P FDR FWER P FDR FWER 1 <10-15 <10-15 <10-15 <10-15 <10-15 <10-15 1 <10-15 <10-15 <10-15 <10-15 <10-15 <10-15 0.75 <10-15 <10-15 <10-15 <10-15 <10-15 <10-15 0.75 <10-15 <10-15 <10-15 <10-15 <10-15 <10-15 0.5 <10-15 <10-15 <10-15 <10-15 0.00006 0.00007 0.5 <10-15 <10-15 0.00022 0.00025 0.01201 0.20204 0.19237 0.25 0.23606 0.25718 0.25 0.00577 0.18854 0.17933 0.00004 0.00005 0.01212 0.24715 0.25242 0.01191 For detailed results, see Table S19 in the supplementary materials. 4.2 SNP-gene mapping A fundamental difference between SNP data and gene expression data is that in gene expression data the gene is the smallest unit carrying the association information. For SNP data, a gene usually covers multiple SNPs and we need to evaluate the joint effect of a set of SNPs mapped into the gene to determine the association. So the first step is to map SNPs to genes. It is almost unfeasible at this time to exactly determine how many SNPs can affect a particular gene and where they are in the genome. There is no clear-cut boundary defining the regulatory region of a gene since some enhancers and repressors may be far away from the target gene. Also the LD block surrounding each trait-associated variant is variable with regard to pattern and length, which may encompass from 0kb to 500kb. Wang et al. mapped the SNPs to the closest gene 17. Peng et al. used all SNPs within a gene to represent that gene 18. Different projects may have different requirements on the possible association region related to a gene. Some people maw want to use only genic variants for association analysis since genic variants are more likely to affect disease risk compared to genetic variants located outside genes while others prefer to extend the gene region to incorporate regulatory region for examining the contribution of cis- and trans-regulatory variants to the disease susceptibility. In our software we did not fix the mapping criteria and intend to let users determine the mapping regions. Two options are available to specify how many base pairs upstream and/or downstream of a gene will be included to establish the association region of that gene. We think this method allows for flexibility across different projects. 4.3 SNP set association test In SNP data, each gene is represented by a varied number of SNPs. There is still no consensus for the best way to assess the joint contribution of a set of SNPs mapped to the same gene to the association. Two forms of association patterns of a SNP set may exist in the association region of a gene: 1) the region harbors only one risk variant; 2) the region harbors multiple risk variants independently contributing to the overall association signal. We use a maximum statistic to assign the highest correlation score among all SNPs mapped to the gene as the correlation score of the gene. Compared to those test statistics that combine correlation scores or p-values across all SNPs in the region into a single correlation score or pvalue, the maximum statistic can more effectively eliminate the negative effects of correlation structure between SNPs, SNPs not associated with the trait of interest, and difference in SNP set size on association inference. The maximum statistic should be the best way to measure association signals when the region harbors one risk variant because multiple markers may be in strong LD with the risk variant. Apparently, this statistic cannot accurately capture overall association information in the second situation where multiple independent risk variants coexist. Fortunately, in this case GSAA can borrow information from differential expression to compensate for this loss of information in the SNP based test. With the help of gene expression, the maximum statistic may be an excellent tradeoff between the two patterns of SNP set association in a gene.(AWKWARD) 4.4 Pathway annotation One of the merits of gene set association analysis is that it can take advantage of prior knowledge of biological pathways. Pathway analyses measure associations of multiple genes simultaneously so it is well suitable for situations where small coordinated changes in a pathway contribute to the overall association signal and result in a significant biological effect. In addition, the genes in a pathway are biologically correlated. This helps interpret the results and also can produce more coherent results across different experimental platforms or strategies. However, this dependence on a priori knowledge can also be considered as a limitation for this type of analysis since the analysis restricted by current knowledge and biases. This means that we may get inaccurate or incomplete information about the pathways resulting in inaccurate association inferences. Also, the gene sets/pathways we use are heterogeneous and we lack the knowledge of the interactive patterns between components within the gene set. Gene sets can be derived from transcriptional modules, metabolic pathways, a cluster of co-expressed genes or even a group of genes tightly linked on the chromosome. This kind of ambiguity results in that it is difficult to establish an unified strategy for describing the knowledge structure in the gene set in order to improve current algorithms. So there may be a need for a method to identify the substructure of pathways and classify them into different categories for optimizing the performance of gene set association analyses. However, fortunately the pathway annotation is becoming increasingly accurate with the accumulation of our knowledge on biological processes and this will definitely increase the power of gene set association test continually. (NEEDS TO BE CLEANED UP AND SHORTENED) 4.5 Normalization and multiple hypothesis testing The normalized enrichment score (NES) is the primary statistic for examining gene set enrichment results in GSEA. GSEA calculates NES by dividing the actual ES by the mean of all ES against all permutations of the dataset for a given gene set. GSEA uses NES to account for the differences in gene set size and in correlations between gene sets and the expression dataset and calculate false discovery rate (FDR). However, in simulations we found that NES has problems under certain specified conditions. We designed 10 gene sets each with 10 genes. These ten gene sets from 1 through 10 include 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 causal genes, respectively. We then used these 10 gene sets to test GSEA by various simulated datasets of gene expression. There are ten different scenarios: 1) Expression values of causal genes are drawn from N(10.1, 1) in the case group, others are drawn from N(10, 1); 2) Expression values of causal genes are drawn from N(10.3, 1) in the case group, others are drawn from N(10, 1); 3) Expression values of causal genes are drawn from N(10.5, 1) in the case group, others are drawn from N(10, 1); 4) Expression values of causal genes are drawn from N(11, 1) in the case group, others are drawn from N(10, 1); 5) Expression values of causal genes are drawn from N(11.5, 1) in the case group, others are drawn from N(10, 1); 6) Expression values of causal genes are drawn from N(12, 1) in the case group, others are drawn from N(10, 1); 7) Expression values of causal genes are drawn from N(12.5, 1) in the case group, others are drawn from N(10, 1); 8) Expression values of causal genes are drawn from N(13, 1) in the case group, others are drawn from N(10, 1). 9) Expression values of causal genes are drawn from N(13.5, 1) in the case group, others are drawn from N(10, 1); 10) Expression values of causal genes are drawn from N(14, 1) in the case group, others are drawn from N(10, 1); Table 4. Comparison of GSEA results based on NES and ES when expression level of risk genes is 2.5-unit higher and sample size is 400 NES ES NAME RANK ES NES P FDR FWER NAME RANK ES P FDR FWER GENESET9 1 0.86 2.11 0 0.004 0.004 GENESET1 1 1.00 0 0 0 GENESET10 2 0.82 2.06 0 0.004 0.007 GENESET2 2 0.99 0 0.001 0.002 GENESET8 3 0.91 2.04 0 0.003 0.007 GENESET3 3 0.99 0 0 0.002 GENESET7 4 0.94 2.04 0 0.002 0.007 GENESET4 4 0.98 0 0.002 0.005 GENESET6 5 0.97 1.96 0 0.005 0.024 GENESET5 5 0.98 0 0.002 0.007 GENESET5 6 0.98 1.88 0 0.014 0.078 GENESET6 6 0.97 0 0.002 0.009 GENESET4 7 0.98 1.82 0 0.032 0.186 GENESET7 7 0.94 0 0.004 0.016 GENESET3 8 0.99 1.75 0 0.059 0.35 GENESET8 8 0.91 0 0.008 0.03 GENESET2 10 0.99 1.67 0 0.094 0.577 GENESET9 9 0.86 0 0.018 0.068 GENESET1 12 1.00 1.63 0 0.111 0.709 GENESET10 10 0.82 0 0.029 0.101 For detailed results, see Table S20 in the supplementary materials. There is no problem with scenario 1-3. However, we found that the rankings of gene sets are almost completely reversed when causal gene expression is 2-unit higher or higher in the case group. The rankings get back to normal if we use actual ES instead of NES (Table 4). In the calculation of ES, we weight genes in a given gene set by their correlation with phenotype of interest normalized by the sum of the correlations over all of the genes in the gene set. This process can account for the differences in gene set size to some degree. Also, FDR and FWER calculated based on permutations of phenotype labels can be used to account for the size of gene sets and adjust for multiple hypothesis testing. Therefore in this study FDR and FWER were calculated based on the actual ES. However, like GSEA, we offer both of ES and NES based analysis in the software. (MOVE TO SUPPLEMENTAL SECTION, WATERS DOWN THE POINT OF OUR PAPER) 4.6 Future directions In this study, we developed a novel statistical framework to integrate gene expression data and SNP data into a genome-wide association analysis of gene sets. Our method can provide great insights into biological processes, such as signal transduction pathways, metabolic pathways, and other physiological or pathological processes, that are associated with traits or diseases and the underlying mechanism regulating these processes. To our knowledge, this is the first computational platform for integrative genome-wide association analysis of gene sets based on two types of different but complementary genomic data from two of most popular high-throughput technologies – genome-wide expression array and SNP array, and it thereby directs future research into more comprehensive integration of genetic, genomic, and proteomic data in pathway based statistical framework. Although we use only two types of genomic data in this study, obviously our framework can be extended easily to incorporate more forms of –omics data, such as copy number variation, methylation data, and microRNA data. We are working towards providing a more powerful statistical and computational platform for genome-wide association analysis of gene sets in the near future. References 1. Carlson, C.S., Eberle, M.A., Kruglyak, L., and Nickerson, D.A. (2004). Mapping complex disease loci in whole-genome association studies. Nature 429, 446-452. 2. Mootha, V.K., Lindgren, C.M., Eriksson, K.F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., et al. (2003). PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 34, 267-273. 3. Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102, 15545-15550. 4. Goeman, J.J., and Buhlmann, P. (2007). Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 23, 980-987. 5. Dinu, I., Potter, J.D., Mueller, T., Liu, Q., Adewale, A.J., Jhangri, G.S., Einecke, G., Famulski, K.S., Halloran, P., and Yasui, Y. (2007). Improving gene set analysis of microarray data by SAM-GS. BMC Bioinformatics 8, 242. 6. Kim, S.Y., and Volsky, D.J. (2005). PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics 6, 144. 7. Luo, W., Friedman, M.S., Shedden, K., Hankenson, K.D., and Woolf, P.J. (2009). GAGE: generally applicable gene set enrichment for pathway analysis. BMC Bioinformatics 10, 161. 8. Boorsma, A., Foat, B.C., Vis, D., Klis, F., and Bussemaker, H.J. (2005). T-profiler: scoring the activity of predefined groups of genes using gene expression data. Nucleic Acids Res 33, W592-595. 9. Newton, M.A., Quintana, F.A., Den Boon, J.A., Sengupta, S., and Ahlquist, P. (2007). Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann Appl Stat 1, 85-106. 10. Goeman, J.J., van de Geer, S.A., de Kort, F., and van Houwelingen, H.C. (2004). A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20, 93-99. 11. Mansmann, U., and Meister, R. (2005). Testing differential gene expression in functional groups. Goeman's global test versus an ANCOVA approach. Methods Inf Med 44, 449-453. 12. Maglietta, R., Piepoli, A., Catalano, D., Licciulli, F., Carella, M., Liuni, S., Pesole, G., Perri, F., and Ancona, N. (2007). Statistical assessment of functional categories of genes deregulated in pathological conditions by using microarray data. Bioinformatics 23, 2063-2072. 13. Allison, D.B., Cui, X., Page, G.P., and Sabripour, M. (2006). Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 7, 55-65. 14. Khatri, P., and Draghici, S. (2005). Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21, 3587-3595. 15. Abatangelo, L., Maglietta, R., Distaso, A., D'Addabbo, A., Creanza, T.M., Mukherjee, S., and Ancona, N. (2009). Comparative study of gene set enrichment methods. BMC Bioinformatics 10, 275. 16. Liu, Q., Dinu, I., Adewale, A.J., Potter, J.D., and Yasui, Y. (2007). Comparative evaluation of gene-set analysis methods. BMC Bioinformatics 8, 431. 17. Wang, K., Li, M., and Bucan, M. (2007). Pathway-Based Approaches for Analysis of Genomewide Association Studies. Am J Hum Genet 81. 18. Peng, G., Luo, L., Siu, H., Zhu, Y., Hu, P., Hong, S., Zhao, J., Zhou, X., Reveille, J.D., Jin, L., et al. (2010). Gene and pathway-based second-wave analysis of genome-wide association studies. Eur J Hum Genet 18, 111-117. 19. O'Dushlaine, C., Kenny, E., Heron, E.A., Segurado, R., Gill, M., Morris, D.W., and Corvin, A. (2009). The SNP ratio test: pathway analysis of genome-wide association datasets. Bioinformatics 25, 27622763. 20. Holden, M., Deng, S., Wojnowski, L., and Kulle, B. (2008). GSEA-SNP: applying gene set enrichment analysis to SNP data from genome-wide association studies. Bioinformatics 24, 2784-2785. 21. Zhong, H., Yang, X., Kaplan, L.M., Molony, C., and Schadt, E.E. (2010). Integrating Pathway Analysis and Genetics of Gene Expression for Genome-wide Association Studies. Am J Hum Genet. 22. Schmidt, M., Hauser, E.R., Martin, E.R., and Schmidt, S. (2005). Extension of the SIMLA package for generating pedigrees with complex inheritance patterns: environmental covariates, gene-gene and gene-environment interaction. Stat Appl Genet Mol Biol 4, Article15. 23. (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661-678. 24. Ogura, Y., Bonen, D.K., Inohara, N., Nicolae, D.L., Chen, F.F., Ramos, R., Britton, H., Moran, T., Karaliuskas, R., Duerr, R.H., et al. (2001). A frameshift mutation in NOD2 associated with susceptibility to Crohn's disease. Nature 411, 603-606. 25. Rioux, J.D., Daly, M.J., Silverberg, M.S., Lindblad, K., Steinhart, H., Cohen, Z., Delmonte, T., Kocher, K., Miller, K., Guschwan, S., et al. (2001). Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease. Nat Genet 29, 223-228. 26. Loftus, E.V., Jr., Schoenfeld, P., and Sandborn, W.J. (2002). The epidemiology and natural history of Crohn's disease in population-based patient cohorts from North America: a systematic review. Aliment Pharmacol Ther 16, 51-60. 27. Gorlov, I.P., Gallick, G.E., Gorlova, O.Y., Amos, C., and Logothetis, C.J. (2009). GWAS meets microarray: are the results of genome-wide association studies and gene-expression profiling consistent? Prostate cancer as an example. PLoS One 4, e6511. 28. Nica, A.C., Montgomery, S.B., Dimas, A.S., Stranger, B.E., Beazley, C., Barroso, I., and Dermitzakis, E.T. (2010). Candidate causal regulatory effects by integration of expression QTLs with complex trait genetic associations. PLoS Genet 6. 29. Nicolae, D.L., Gamazon, E., Zhang, W., Duan, S., Dolan, M.E., and Cox, N.J. (2010). Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet 6. 30. Peduzzi, P., Concato, J., Kemper, E., Holford, T.R., and Feinstein, A.R. (1996). A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 49, 1373-1379. 31. (2008). Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455, 1061-1068. 32. Bortoluzzi, S., Coppe, A., Bisognin, A., Pizzi, C., and Danieli, G.A. (2005). A multistep bioinformatic approach detects putative regulatory elements in gene promoters. BMC Bioinformatics 6, 121. 33. El Hallani, S., Ducray, F., Idbaih, A., Marie, Y., Boisselier, B., Colin, C., Laigle-Donadey, F., Rodero, M., Chinot, O., Thillet, J., et al. (2009). TP53 codon 72 polymorphism is associated with age at onset of glioblastoma. Neurology 72, 332-336. 34. Ishii, N., Maier, D., Merlo, A., Tada, M., Sawamura, Y., Diserens, A.C., and Van Meir, E.G. (1999). Frequent co-alterations of TP53, p16/CDKN2A, p14ARF, PTEN tumor suppressor genes in human glioma cell lines. Brain Pathol 9, 469-479. 35. Backlund, L.M., Nilsson, B.R., Goike, H.M., Schmidt, E.E., Liu, L., Ichimura, K., and Collins, V.P. (2003). Short postoperative survival for glioblastoma patients with a dysfunctional Rb1 pathway in combination with no wild-type PTEN. Clin Cancer Res 9, 4151-4158. 36. Blum, R., Nakdimon, I., Goldberg, L., Elkon, R., Shamir, R., Rechavi, G., and Kloog, Y. (2006). E2F1 identified by promoter and biochemical analysis as a central target of glioblastoma cell-cycle arrest in response to Ras inhibition. Int J Cancer 119, 527-538. 37. Khatri, R.G., Navaratne, K., and Weil, R.J. (2008). The role of a single nucleotide polymorphism of MDM2 in glioblastoma multiforme. J Neurosurg 109, 842-848. 38. Zhang, R., Banik, N.L., and Ray, S.K. (2008). Combination of all-trans retinoic acid and interferongamma upregulated p27(kip1) and down regulated CDK2 to cause cell cycle arrest leading to differentiation and apoptosis in human glioblastoma LN18 (PTEN-proficient) and U87MG (PTENdeficient) cells. Cancer Chemother Pharmacol 62, 407-416. 39. Lam, P.Y., Di Tomaso, E., Ng, H.K., Pang, J.C., Roussel, M.F., and Hjelm, N.M. (2000). Expression of p19INK4d, CDK4, CDK6 in glioblastoma multiforme. Br J Neurosurg 14, 28-32. 40. Korshunov, A., Golanov, A., Sycheva, R., and Pronin, I. (1999). Prognostic value of tumour associated antigen immunoreactivity and apoptosis in cerebral glioblastomas: an analysis of 168 cases. J Clin Pathol 52, 574-580. 41. Gomez-Manzano, C., Fueyo, J., Kyritsis, A.P., McDonnell, T.J., Steck, P.A., Levin, V.A., and Yung, W.K. (1997). Characterization of p53 and p21 functional interactions in glioma cells en route to apoptosis. J Natl Cancer Inst 89, 1036-1044. 42. Junge, C.E., Lee, C.J., Hubbard, K.B., Zhang, Z., Olson, J.J., Hepler, J.R., Brat, D.J., and Traynelis, S.F. (2004). Protease-activated receptor-1 in human brain: localization and functional expression in astrocytes. Exp Neurol 188, 94-103. 43. Kaufmann, R., Patt, S., Schafberg, H., Kalff, R., Neupert, G., and Nowak, G. (1998). Functional thrombin receptor PAR1 in primary cultures of human glioblastoma cells. Neuroreport 9, 709712. 44. Smirnova, I.V., Zhang, S.X., Citron, B.A., Arnold, P.M., and Festoff, B.W. (1998). Thrombin is an extracellular signal that activates intracellular death protease pathways inducing apoptosis in model motor neurons. J Neurobiol 36, 64-80. 45. Turgeon, V.L., Lloyd, E.D., Wang, S., Festoff, B.W., and Houenou, L.J. (1998). Thrombin perturbs neurite outgrowth and induces apoptotic cell death in enriched chick spinal motoneuron cultures through caspase activation. J Neurosci 18, 6882-6891. 46. Guo, H., Liu, D., Gelbard, H., Cheng, T., Insalaco, R., Fernandez, J.A., Griffin, J.H., and Zlokovic, B.V. (2004). Activated protein C prevents neuronal apoptosis via protease activated receptors 1 and 3. Neuron 41, 563-572. 47. Flynn, A.N., and Buret, A.G. (2004). Proteinase-activated receptor 1 (PAR-1) and cell apoptosis. Apoptosis 9, 729-737. 48. Panner, A., Cribbs, L.L., Zainelli, G.M., Origitano, T.C., Singh, S., and Wurster, R.D. (2005). Variation of T-type calcium channel protein expression affects cell division of cultured tumor cells. Cell Calcium 37, 105-119. 49. Yamahata, H., Takeshima, H., Kuratsu, J., Sarker, K.P., Tanioka, K., Wakimaru, N., Nakata, M., Kitajima, I., and Maruyama, I. (2002). The role of thrombin in the neo-vascularization of malignant gliomas: an intrinsic modulator for the up-regulation of vascular endothelial growth factor. Int J Oncol 20, 921-928. 50. Hua, Y., Tang, L., Keep, R.F., Schallert, T., Fewel, M.E., Muraszko, K.M., Hoff, J.T., and Xi, G. (2005). The role of thrombin in gliomas. J Thromb Haemost 3, 1917-1923. 51. Hua, Y., Tang, L., Keep, R.F., Hoff, J.T., Heth, J., Xi, G., and Muraszko, K.M. (2008). Thrombin enhances glioma growth. Acta Neurochir Suppl 102, 363-366. 52. Wang, J., and Lenardo, M.J. (2000). Roles of caspases in apoptosis, development, and cytokine maturation revealed by homozygous gene deficiencies. J Cell Sci 113 ( Pt 5), 753-757. 53. Cohen, G.M. (1997). Caspases: the executioners of apoptosis. Biochem J 326 ( Pt 1), 1-16. 54. Gdynia, G., Grund, K., Eckert, A., Bock, B.C., Funke, B., Macher-Goeppinger, S., Sieber, S., HeroldMende, C., Wiestler, B., Wiestler, O.D., et al. (2007). Basal caspase activity promotes migration and invasiveness in glioblastoma cells. Mol Cancer Res 5, 1232-1240.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A statistical framework for genome