Download Supplementary Methods - Molecular Cancer Research

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Supplementary Methods
Transcript Abundance Data Meta-analysis
The term “meta-analysis” can be used to describe a number of approaches, broadly split
into methods that examine each dataset independently and then compare or combine the
results, and methods that combine all data with appropriate normalisation before analysis.
As detailed in recent reviews (1,2), there are 12 techniques to obtain p-values when
comparing/combining results, further classified into three types: combine p-values; combine
effect sizes, and; combine ranks. These methods can also be categorized by the purpose of
their hypothesis setting (HS): HSA identifies genes that are differentially expressed in all
studies; HSB detects genes that are differentially expressed in at least one study; and HS r
identifies genes that are differentially expressed in the majority of studies. Here, we have
applied the ‘Product of Rank’ (PR) metric which has been shown to have better performance
characteristics (biological association, stability and robustness) under HSA (1). Similarly,
several methods have been developed for normalisation or batch correction of integrated
data prior to analysis (3-11). For this study we used ComBat, an empirical Bayesian
framework useful for removing known batch effects within data (8), and SVA, which can
remove unknown batch effects by using factor analysis to estimate surrogate variables from
the expression data (11).
Quality control of the microarray datasets
Data quality was assessed using methods provided within the R/Bioconductor package
Simpleaffy (12) for the Affymetrix platform data and using unsupervised methods for other
platforms, excluding all low quality samples from the analysis. Quality control methods
included hierarchical clustering, principal component analysis (PCA) and/or non-metric
multidimensional scaling (NMDS) plots, RNA degradation plots, correlation matrices,
boxplots, and plots of GAPDH 3’: 5’ ratio and β-actin 3’:5’ ratio. Using unsupervised
techniques (PCA, NMDS and hierarchical clustering) TGFβ-treated samples were separated
from control samples. RNA degradation plots showed very similar slopes for all arrays
examined, and based upon the array correlation matrices, samples from the same
conditions had higher correlations compared to those from different conditions.
Obtaining TGFβ-EMT signature using two meta-analysis techniques
We took raw data from each study (Table 1) and performed RMA normalization using the R
packages affy (13) and lumi (14). Annotation files were manually downloaded from
manufacturer websites (Agilent on April 2015, Affymetrix on May 2015, and Illumina on
October 2015). For genes with multiple probes, the probe with the lowest adjusted p-value
for differential transcript abundance was used, while probes that mapped to multiple genes
were removed.
First, we obtained separate lists of DEGs from each study using the R/Bioconductor package
limma (15,16). Genes with an adjusted p-value < 0.05 and log2|FC| > 1 were considered as
differentially expressed. Next, we applied the Product of Ranks (PR) method to score DEGs
across all studies. For each gene, g, the PR method ranks the specified statistic (e.g. adjusted
p-values or t-statistic) from each study, k, and then multiplies these values together across
all K studies:
𝐾
𝑃𝑅𝑔 = ∏ 𝑅𝑔𝑘
𝑘=1
For up-regulated genes, the t-statistics were ranked in decreasing order, while we reversed
the ranks for down-regulated genes. A null distribution for the PR metric was obtained by
permutation testing of all gene ranks to calculate random PRs (permuted PR), which were
then log2 transformed. Thresholds were specified from the 99.999% confidence interval of
the null distribution: the lower band of the 99.999% confidence intervals of the permuted
PRs was 92.1 for up gene set, and 92.2 for down-regulated genes. This resulted in the
identification of 186 up-regulated and 82 down-regulated genes that fell below the 99.999%
confidence interval of the permuted null distribution. Only 0.0005% of permuted PRs fell
below this interval; given that 11900 genes were analysed, we estimate a false positive rate
for up- and down- regulated genes of 0.0003 and 0.0007, respectively.
We obtained our final list of DEGs by integrating RMA (Robust Multichip Average)
normalised data from all ten studies (Table 1) (13) mapped through Entrez gene IDs (nGenes =
11900). SVA (11) and ComBat (8) were applied on the integrated data. SVA estimated 14
sources of variation within the data, and then we removed the “different datasets” batch
effects using ComBat within the SVA package. The resulting data has no apparent batch
effects confirming that these sources of unwanted variation have been removed. Finally, we
applied the R/Bioconductor package limma (15,16) to obtain test statistics for differential
transcript abundance and extracted our list of DEGs.
Scoring methods
The GSVA package in R/Bioconductor (17) has options for performing four pathway/geneset enrichment methods: PLAGE (pathway level analysis of Gene Expression), Combining ZScore, ssGSEA (single-sample Gene Set Enrichment Analysis), and GSVA (Gene Set Variation
Analysis) (17). It has been shown that the GSVA method outperforms the other three
methods. However, we applied both the ssGSEA and GSVA methods throughout this study;
ssGSEA results have been presented within the main text, and corresponding GSVA results
are included within Additional file 5. Two enrichment statistics (SE) can be obtained using
GSVA method including SEmax which is the maximum deviation approach introduced by
Subramanian et al. (18), and SEdiff which is a normalised GSVA score by calculating the
difference between the absolute values of the maximum and minimum random walk
deviations from zero (17). For GSVA, we applied SEdiff on all the data, as the normalised
unimodal statistic can remove potential noise within the data and applies a penalty for
deviations towards the extreme tails of the distribution (17). It should be mentioned that
SEmax was also applied to some data and the results were very similar (data not shown).
For the up-regulated gene set within our TGFβ-EMT signature, we ranked gene expression
values in decreasing order. Conversely, for the down-regulated gene set we ranked genes in
an increasing order, such that these scores could be combined to reflect the sample
enrichment for both up- and down-regulated gene sets. Finally, the scores obtained for upand down-regulated gene sets were summed up to provide a summed-ssGSEA or -GSVA
value for each sample, referred to as the TGFβ-EMT scores (TES). Consequently, samples
with a high TES show a strong concordance with our signature, and are likely to have
undergone some form of TGFβ-induced EMT.
Although both methods mentioned above have been proposed to as single-sample scoring
methods, they include information from other samples in some steps: GSVA starts with
kernel estimation of the cumulative distribution function for each gene across all the
samples, while scores obtained by ssGSEA are normalised against other samples at the end
of the analysis. Therefore, in addition to these methods, we also applied a simple rankbased scoring method which uses no information from other samples. Scores for up and
down gene sets were obtained separately by ranking gene expression values in each sample
(in decreasing order for the up- gene set, and increasing order for down-regulated gene set),
then we calculated the mean rank of the up- and down-regulated gene sets within each
sample and summed them to obtain a single-sample score:
Score𝑖 =
∑𝐷𝑅𝑢𝑝,𝑖 ∑𝐴𝑅𝑑𝑜𝑤𝑛,𝑖
+
𝑁𝑢𝑝
𝑁𝑑𝑜𝑤𝑛
Where: Scorei is the simple single-sample score for sample i; DRup, i is the rank (by
descending transcript abundance) for the up-regulated gene set in sample i; ARdown, i is the
rank (by ascending transcript abundance) for the down-regulated gene set in sample i, and
Nup and Ndown are the number of genes in the up- and down- gene sets, respectively.
It should be noted that we assumed the scores were not likely to be substantially affected
by batch effects. Additional file 14 and 11 show a batch analysis of the CCLE data, and the
TCGA breast cancer and pan-cancer data. The biology of the TCGA pan-cancer data were
confounded with apparent batch effects, however, this may only have a slight impact upon
the resultant TES (as discussed in the Supplementary file 6).
Datasets used in this study
Cell line data used in this study were derived from the NCI-60 (19,20), CCLE (Cancer Cell Line
Encyclopedia (21)), and COSMIC (22) pan-cancer cell lines, as well as the Neve et al. (2006)
(23) and Heiser et al. (2012) (24) breast cancer cell lines. For NCI-60 cell lines, RMA
normalized transcript abundance data for 59 cell lines (excluding the LC-NCIH23 cell line)
obtained using the Affy HG-U133 (A_B) platform were downloaded on 18th of November
2015 from the CellMiner database (20). CCLE data were downloaded on 24th of November
2015. The Neve data were downloaded from the Supplementary file of the original paper on
15th of November 2014. Probe IDs that mapped to more than one Entrez ID within the
NCI-60 and Neve data were considered as median values. To assign clinical breast cancer
subtypes (TN, HR, HER2) to the Neve breast cancer cell lines, all ER-/PR-/HER2- cancer cell
lines were considered as TN, PR+ or ER+ cell lines were assigned to HR, and those with
HER2+ were considered as the HER2 group.
To assess the drug sensitivity of cancer cell lines with high- or low-TES, we examined drug
data from NCI-60, COSMIC and Heiser et al. We examined approximately 20,000 drug
compounds from NCI-60 which are considered as high quality
(http://discover.nci.nih.gov/cellminer/datasetMetadata.do>), downloaded from CellMiner
database on 22nd of Feb 2016. The COSMIC transcript (Affymetrix Human Genome U219
Array) and drug data were obtained on 20th of June 2016 from COSMIC website
(“http://cancer.sanger.ac.uk/cosmic”). The SRA files of the Heiser RNASeq data were
downloaded from NCBI (SRP026537) on 18th of July 2016, and were analysed using Rsubread
package in R to obtain FPKM values, which then were used for scoring cell lines. The Heiser
drug data were also collected on 31st of July 2016 (25). When analysing each of the cancer
types from NCI60 separately, we included only drugs that had no missing values for that
specific cancer. As there were many more cell lines for each cancer type within the COSMIC
data, we included any drug that had information available for at least 10% of cell lines.
This study also analyses TCGA breast cancer mRNA expression data (RMA normalised values
from the Agilent4502A_07_03 microarray platform, downloaded from the UCSC Cancer
Genomics Browser “https://genome-cancer.ucsc.edu” on Oct 2015) and TCGA pan-cancer
data (RSEM normalised mRNA expression data, downloaded from the UCSC Cancer
Genomics Browser “https://genome-cancer.ucsc.edu”on 23th of Feb 2016). Batch
information for TCGA pan-cancer data were obtained from the Biotab files downloaded on
30th of May 2016 from the TCGA website.
Mutation data for the CCLE and the TCGA data were downloaded on 18 th of Dec 2015 and
10th of Jan 2016, respectively.
References
1.
2.
3.
4.
Chang LC, Lin HM, Sibille E, Tseng GC. Meta-analysis methods for combining multiple
expression profiles: comparisons, statistical characterization and an application
guideline. BMC bioinformatics 2013;14:368.
Tseng GC, Ghosh D, Feingold E. Comprehensive literature review and statistical
considerations for microarray meta-analysis. Nucleic Acids Res 2012;40(9):3785-99.
Gagnon-Bartsch J, Jacob, L., Speed, T.P. . Removing unwanted variation from high
dimensional data with negative controls. . Technical Report, UC Berkeley 2013.
Gagnon-Bartsch JA, Speed TP. Using control genes to correct for unwanted variation
in microarray data. Biostatistics (Oxford, England) 2012;13(3):539-52.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
Jacob L, Gagnon-Bartsch JA, Speed TP. Correcting gene expression data when neither
the unwanted variation nor the factor of interest are observed. Biostatistics (Oxford,
England) 2016;17(1):16-28.
Shabalin AA, Tjelmeland H, Fan C, Perou CM, Nobel AB. Merging two geneexpression studies via cross-platform normalization. Bioinformatics (Oxford,
England) 2008;24(9):1154-60.
Benito M, Parker J, Du Q, Wu J, Xiang D, Perou CM, et al. Adjustment of systematic
microarray data biases. Bioinformatics (Oxford, England) 2004;20(1):105-14.
Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data
using empirical Bayes methods. Biostatistics (Oxford, England) 2007;8(1):118-27.
Rudy J, Valafar F. Empirical comparison of cross-platform normalization methods for
gene expression data. BMC bioinformatics 2011;12:467.
Walker WL, Liao IH, Gilbert DL, Wong B, Pollard KS, McCulloch CE, et al. Empirical
Bayes accomodation of batch-effects in microarray data using identical replicate
reference samples: application to RNA expression profiling of blood from Duchenne
muscular dystrophy patients. BMC genomics 2008;9:494.
Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing
batch effects and other unwanted variation in high-throughput experiments.
Bioinformatics (Oxford, England) 2012;28(6):882-3.
Wilson CL, Miller CJ. Simpleaffy: a BioConductor package for Affymetrix Quality
Control and data analysis. Bioinformatics (Oxford, England) 2005;21(18):3683-5.
Gautier L, Cope L, Bolstad BM, Irizarry RA. affy--analysis of Affymetrix GeneChip data
at the probe level. Bioinformatics (Oxford, England) 2004;20(3):307-15.
Du P, Kibbe WA, Lin SM. lumi: a pipeline for processing Illumina microarray.
Bioinformatics (Oxford, England) 2008;24(13):1547-8.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential
expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res
2015.
Smyth GK. Limma: linear models for microarray data. Bioinformatics and
computational biology solutions using R and Bioconductor: Springer; 2005. p 397420.
Hanzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray
and RNA-seq data. BMC bioinformatics 2013;14:7.
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al.
Gene set enrichment analysis: a knowledge-based approach for interpreting
genome-wide expression profiles. Proceedings of the National Academy of Sciences
of the United States of America 2005;102(43):15545-50.
Shankavaram UT, Varma S, Kane D, Sunshine M, Chary KK, Reinhold WC, et al.
CellMiner: a relational database and query tool for the NCI-60 cancer cell lines. BMC
genomics 2009;10:277.
Reinhold WC, Sunshine M, Liu H, Varma S, Kohn KW, Morris J, et al. CellMiner: a
web-based suite of genomic and pharmacologic tools to explore transcript and drug
patterns in the NCI-60 cell line set. Cancer research 2012;72(14):3499-511.
Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, Kim S, et al. The
Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug
sensitivity. Nature 2012;483(7391):603-7.
22.
23.
24.
25.
Garnett MJ, Edelman EJ, Heidorn SJ, Greenman CD, Dastur A, Lau KW, et al.
Systematic identification of genomic markers of drug sensitivity in cancer cells.
Nature 2012;483(7391):570-5.
Neve RM, Chin K, Fridlyand J, Yeh J, Baehner FL, Fevr T, et al. A collection of breast
cancer cell lines for the study of functionally distinct cancer subtypes. Cancer Cell
2006;10(6):515-27.
Heiser LM, Sadanandam A, Kuo WL, Benz SC, Goldstein TC, Ng S, et al. Subtype and
pathway specific responses to anticancer compounds in breast cancer. Proceedings
of the National Academy of Sciences of the United States of America
2012;109(8):2724-9.
Daemen A, Griffith OL, Heiser LM, Wang NJ, Enache OM, Sanborn Z, et al. Modeling
precision treatment of breast cancer. Genome biology 2013;14(10):R110.