Download Supplementary Methods - Clinical Cancer Research

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genetic engineering wikipedia , lookup

Epigenetic clock wikipedia , lookup

Ridge (biology) wikipedia , lookup

Pathogenomics wikipedia , lookup

X-inactivation wikipedia , lookup

Public health genomics wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Behavioral epigenetics wikipedia , lookup

Mutation wikipedia , lookup

Epigenetics of depression wikipedia , lookup

Gene nomenclature wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Gene desert wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

DNA methylation wikipedia , lookup

Gene therapy wikipedia , lookup

Metagenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Copy-number variation wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

History of genetic engineering wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Epigenetics in stem-cell differentiation wikipedia , lookup

NEDD9 wikipedia , lookup

Gene wikipedia , lookup

Genomic imprinting wikipedia , lookup

Point mutation wikipedia , lookup

Epigenomics wikipedia , lookup

Genome (book) wikipedia , lookup

Genome evolution wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Helitron (biology) wikipedia , lookup

Gene expression programming wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Designer baby wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Oncogenomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Supplementary Methods
Patient material - discovery set
Included patients were operated at the Skåne University Hospital in Lund, Sweden.
No patient included in the study received neoadjuvant therapy prior to surgery. For
patients in the discovery cohort smoking history was obtained from patients’ medical
records and categorized into three groups; current-, former- or never-smoker. Followup data was obtained from the Swedish Cause of Death Register. For all cases, all
relevant pathological slides were reviewed for re-evaluation and updating of the
histological diagnoses and stages to be in adherence with recent international criteria
and guidelines (1-4).
The study was approved by the Regional Ethical Review Board in Lund,
Sweden (Registration no. 2004/762 and 2008/702). Written informed consent was
obtained from all patients diagnosed after 2004. For patients diagnosed earlier than
2004, study inclusion was approved by the Regional Ethical Review Board in Lund,
Sweden, if patients (or their family members/survivors) not stated otherwise when
they were informed about the study in 2006.
EGFR and KRAS mutation analyses in the discovery cohort
EGFR and KRAS mutations were analyzed by the Therascreen® EGFR or KRAS
RGQ PCR Kit (Qiagen, Hilden, Germany) according to the manufacturer's protocol.
Validation tumor cohorts
In addition to the discovery cohort we analyzed 444 NSCLC tumors from Sandoval et
al. (5), 373 adenocarcinomas from The Cancer Genome Atlas consortium (6), and 69
NSCLC cell lines from (7) (GSE36216), all profiled by the same Illumina 450K
methylation platform. These cohorts were processed similarly as the discovery cohort
if not stated otherwise below. Data for the TCGA cohort was accessed September 11,
2013.
Global methylation analysis
DNA and total RNA were extracted using the AllPrep DNA/RNA Mini Kit (Qiagen)
according manufacturer’s instructions from a single tissue piece. 500 ng of DNA was
subjected to bisulfite conversion using the EZ-96 DNA Methylation Kit (Zymo
Research), with a modification to the manufacturer’s instructions using a 16 cycling
of 95C for 30s followed by 50C for 1 hour according to recommendations from
Illumina (8), in two 96-well plates balanced for sample histology, stage, sex, and
smoking status. The entire amount of bisulfite converted DNA was subject to the
“Illumina Infinium HD Methylation Assay” resulting in hybridization to the Human
Methylation 450K v1.0 BeadChip according to the manufacturer’s instructions
(Illumina) at SCIBLU Genomics, Lund University, Sweden.
Peak normalization of Infinium I and II data
Prior to correction of Infinium I and II probe intensity bias CpGs with detection pvalue < 0.01 were set as NA (missing value). Adjustment of bias between Infinium I
and II CpG probes were performed by a peak normalization algorithm. Briefly, for
each sample we performed a peak-based correction of Illumina I and II chemical
assays similar to Dedeurwaerder et al. (9). For both assays we smoothed the beta
values (Epanechnikov smoothing kernel) to estimate unmethylated and methylated
peaks, respectively; and the unmethylated peak was moved to 0 and the methylated
peak to 1 using linear scaling, with beta-values in between stretched accordingly.
Beta-values below 0 were set back to 0 and values above 1 were set to 1. After
correction, CpGs located on sex chromosomes were removed.
Bisulfite plate adjustment of methylation data
To remove any bias due to the processing of samples in different 96-well plates in the
bisulfite conversion step we normalized beta-values for plate association. The
experimental design included balancing the two 96-well plates used in the bisulfite
conversion and subsequent labeling for tumor histology, stage, patient smoking status,
and patient sex. Bisulfite plate 1 was selected as the arbitrary reference. Mean betavalues for each probe on plate 2 were set to the mean of corresponding probes on
plate 1. Probes on plate 2 were then adjusted correspondingly to fit the new mean, and
trimmed so that no probes extended outside [0,1] in beta-value. Principal component
analysis was performed to verify that no technical artifacts caused systematic bias in
the final data (10).
Generation of copy number estimates from Illumina methylation beadchips
In the discovery cohort, log2 copy number estimates for CpG probes were generated
from unmethylated and methylated signals obtained from GenomeStudio (Illumina)
for each tumor sample by: 1) quantile normalization of Cy3 and Cy5 Infinium I
probes, 2) calculation of a summarized total intensity for each probe, and 3) dividing
a tumor’s total intensity with corresponding average total intensity from the 12
normal tissues for each probe. Genomic profiles were partitioned using GLAD (11)
and centralized similarly as described (12, 13). Calls of copy number gain and loss
were made against fixed log2ratio thresholds of ±0.05. The fraction of the genome
altered by copy number alterations (CN-FGA) was defined as described (14).
Genomic profiles were screened for amplifications and verified when possible to
matching cases analyzed on BAC aCGH from GSE29066 (13) to assure correctness.
Copy number estimates for the Sandoval et al. cohort were generated as above
with one exception. Instead of dividing each tumors total intensity with the
corresponding average total intensity from the 12 normal tissues from the discovery
cohort for each probe (point 3 above), the total intensity was divided by the average
intensity of all Sandoval et al. tumor cases. The reason for this was that a large
difference was observed in intensity between Sandoval et al. cases and the matched
normal samples from the discovery cohort, and that no normal samples from the
Sandoval et al. study were publicly available. Calls of copy number gain and loss
were made against fixed log2ratio thresholds of ±0.1. The fraction of the genome
altered by copy number alterations (CN-FGA) was defined as described (14).
Identification of CpG probes with aberrant methylation compared to normal lung
tissue
To identify CpG probes with aberrant methylation in tumors compared to normal lung
tissue we first identified 218821 CpGs being either methylated (beta-values >0.9) or
demethylated (beta-values <0.1) across all 12 normal lung tissues included in the
discovery cohort. From this CpG set we selected 4136 probes that displayed
difference in methylation in ≥13 tumors compared to the normal tissues (either betavalues < 0.5 in ≥13 tumors compared to >0.9 in normals referred to as
hypomethylated in tumors, or >0.5 in ≥13 tumors compared to <0.1 in normals,
referred to as hypermethylated in tumors). In addition, we used different cut-offs,
ranging from 5 to 20 tumors to assess robustness of bootstrap analysis and centroid
prediction. Notably, the 12 matched normal samples comprised of a mix of males
(n=3) and females (n=9), never-smokers (n=6) and smokers (n=6), and with a spread
in patient age (range 57-82 years). All matched normal specimens came from patients
with adenocarcinoma. In subsequent analyses, a CpG is termed as promoter-related if
it has an Illumina annotation of TSS1500 or TS200, and a CpG island annotation.
CpGs in repetitive elements were identified through the “repeats_rmsk_hg19” table
from the UCSC Genome Browser.
Bootstrap clustering of genome-wide methylation data
Class-analysis was performed using bootstrap clustering as described (15, 16) using
2000 permutations, Euclidian distance and ward linkage. Briefly, for each bootstrap
the hierarchical cluster dendrogram is cut into the specified number of groups and the
assignment of samples to these groups is recorded. Then, for each pair of samples, the
frequency with which the two samples have fallen into the same groups is calculated.
The co-clustering frequency matrix is then reordered by hierarchical clustering to
identify the methylation subtypes, i.e. subsets of samples that repeatedly cluster
together. The clustering was performed using different number of cluster solutions
and different sets of CpGs to investigate robustness. The final solution was based on
4136 CpGs and a five-group cluster solution. DNA methylation centroids representing
bootstrap subgroups were created from the average beta-value for each CpG probe in
respective subgroup. Samples in independent cohorts were assigned to the centroid
with the smallest Euclidean distance for matching CpGs, representing a single-sample
predictor.
Copy number analysis
Amount of copy number alterations, CN-FGA, was calculated as the number of
CpGs/probes with copy number gain or loss divided by the total number of
CpGs/probes for the platform (SNP6 or Beadchip). CAAI-scores were calculated for a
tumor as described by Russnes et al. (12). A case was classified as CAAI positive if
one or more chromosome arms were affected by complex alterations with a CAAI
score >1 for samples in the discovery cohort, or >2 in the TCGA cohort. The reason
for the difference in cut-off between the cohorts is due to the different platforms from
which the copy number data was generated (Affymetrix SNP6 for TCGA, 450K
methylation beadchips for the discovery cohort). The different platforms have
different responses (platform-related characteristics) to copy number change
(amplitude). This renders large systematic differences in the amplitude of copy
number change (SNP6 higher, 450K lower) between the cohorts. The amplitude of
copy number change is one important variable in the CAAI calculation.
Global gene expression analysis
Total RNA was obtained from the same tumor piece used for DNA extraction. Total
RNA from 117 tumors in the discovery cohort were labeled in a 96-well format using
the Total Prep-96 RNA amplification kit, hybridized to Illumina Human HT-12 V4
microarrays, and scanned according to manufacturer’s instructions (Illumina). Gene
expression data were quantile normalized and mean-centered for each probe across all
samples. Probe sets not having signal intensity above the median of negative control
intensity signals in at least 80% of samples were excluded from analysis.
TCGA adenocarcinoma expression data were obtained as level 3 RNASeq V2
data (6) and processed as described (17).
Classification into adenocarcinoma and SqCC molecular subtypes (18, 19),
and calculation of a CIN70 proliferation metagene (20), and terminal respiratory unit
(TRU) metagene (21) were performed as described (22).
Consensus clustering of adenocarcinomas in the discovery cohort was
performed as recently described (17) using ConsensusClusterPlus (23), after filtering
out probe sets with <0.5 in log2ratio standard deviation across all tumors, and probe
sets without a single gene annotation.
Correlated gene expression modules representing different tumor associated
processes were derived as originally described by Fredlund et al. (24) in GSE29016
(25). Briefly, in the normalized expression data we first removed probes without a
gene symbol, or probes with a LOCXXX gene symbol, then probes with a log2ratio
standard deviation < 0.8 across the entire sample set. For remaining probes the
Pearson correlation between each probe was calculated. Only probes with positive
correlation >0.8 to at least four other probes were kept, and entered into Cytoscape
(www.cytosacpe.org) for gene network analysis as described in Fredlund et al. (24).
Six networks of genes were identified and labeled according to results from gene
ontology analysis of participating genes (See Table S1). For identified gene networks,
metagene scores were calculated in the gene expression data from the discovery and
TCGA cohorts similar to the CIN70 score above.
Differential gene expression between epitypes
In the discovery cohort differentially expressed genes between epitypes (only using
adenocarcinomas with matched gene expression) were detected using Kruskal-Wallis
test with false discovery rate adjustment. Prior to the Kruskal-Wallis test, probes with
an log2ratio standard deviation <0.5 across all tumors were removed. Only gene
expression probes with false discovery rate < 0.05 were kept (n=1824).
In the TCGA RNAseq data genes with log2ration standard deviation <1 across
all tumors were removed, and differentially expressed between epitypes were detected
as in the discovery cohort (n=5726 genes).
Functional classification
Gene Ontology enrichment were performed using the DAVID Functional Annotation
Tool (26) with the default human population background and a Bonferroni-adjusted pvalue <0.05 as significance threshold. For gene expression analyses the discovery
cohort was matched with illumina identifiers in DAVID, while for the TCGA cohort
Entrez gene ids were used to map differentially expressed genes. For methylation
data, CpG annotation data from Illumina was used to map CpGs with aberrant
methylation to genes.
DNA methylation and gene expression correlation analysis
Correlation of DNA methylation and gene expression data was performed for the
largest histology in the discovery cohort, adenocarcinoma (n=77 samples profiled by
gene expression microarrays) using Spearman correlation. Prior to analysis gene
expression probes with a log2ratio standard deviation < 0.5 across all samples were
removed, together with CpG probes with a beta standard deviation <0.1 across all
samples. If multiple gene expression probes existed for one gene, the probe with the
highest standard deviation was chosen to represent the gene. CpGs were matched to
gene expression data based on gene annotation, allowing multiple CpGs to be
associated with one expression value creating in the end two matched matrices, one
methylation matrix (m cpgs x n samples) and one gene expression matrix (m genes x
n samples).
A false discovery rate for correlations was calculated through permutation of
sample labels (n=1000 permutations). CpG – gene expression correlations with a false
discovery rate < 0.05 were kept.
Mutation analyses in TCGA samples
The used MAF file of somatic mutations for TCGA samples was accessed September
11, 2013.
Calculation of transversion frequencies
Frequencies of different mutation transversions, such as C>A, etc. were estimated
using all mutations with a valid transversion included in the maf file.
Calculation of total number of mutations
The total number of mutations for a given sample was calculated as the sum of all
mutations in the maf-file for that sample.
MutSigCV analysis
MutSigCV (27) analysis was performed on the maf file using default settings. Prior to
analysis, we updated the gene.covariates.txt file with new values for the expression of
a gene. The new value for each gene was set as the mean of log2(expression count+1)
for all cases (row-means). 174 genes showed a q-value<0.05 and were used in further
analyses.
Permutation analysis to find mutated genes associated with groups
To investigate association of mutations in the 174 significant genes from MutSigCV
with epitypes we used a permutation-based approach to estimate a false discovery rate
as outlined below.
1. All cases with at least 1 mutation in the maf file were selected. For these cases
the total number of non-silent mutations per sample was calculated for the 174
genes.
2. For each of the 174 genes we calculated a Fisher’s exact p-value from the
contingency table in question (analytical p-value).
3. For each of the 174 genes we created a permuted distribution of p-values by
first randomizing mutations across samples. Specifically, we used probability
weights equaling the vector of total number of non-silent mutations per sample
(see 1 above) to mimick that different samples (and groups) have different
mutation rates from start. We performed 10000 randomizations per gene,
calculating a Fisher’s exact p-value for each. This created a distribution of
10000 permuted p-values for each gene.
4. All permuted p-values for the 174 genes (174*10000 values) were collected
into a single distribution.
5. For each of the 174 genes we determined the number of ‘expected’ genes from
the permuted distribution for the specific analytical p-value.
6. A false discovery rate was determined as the number of expected / observed
genes at a given p-value.
7. Only genes with ≤1 expected gene from permutation analysis were analyzed
further.
References
1.
Travis W.D. BE, Muller-Hermelink H.K., Harris C.C. (Eds.). World Health
Organization Classification of Tumours. Pathology and Genetics of Tumours of
the Lung, Pleura, Thymus and Heart. Lyon: IARC Press; 2004.
2.
Travis WD, Brambilla E, Noguchi M, Nicholson AG, Geisinger K, Yatabe Y,
et al. International Association for the Study of Lung Cancer/American Thoracic
Society/European Respiratory Society: international multidisciplinary
classification of lung adenocarcinoma: executive summary. Proc Am Thorac Soc
2011;8: 381-5.
3.
Goldstraw P. International Association for the Study of Lung Cancer
(IASLC). Staging manual in thoracic oncology. Orange Park: Editorial RxPress;
2009.
4.
Sobin L GM, Wittekind C. International Union Against Cancer (UICC). TNM
classification of malignant tumours. 7th edn ed. Chichester: Wiley-Blackwell;
2009.
5.
Sandoval J, Mendez-Gonzalez J, Nadal E, Chen G, Carmona FJ, Sayols S, et
al. A Prognostic DNA Methylation Signature for Stage I Non-Small-Cell Lung
Cancer. J Clin Oncol 2013.
6.
The
Cancer
Genome
Atlas.
[cited;
Available
from:
http://cancergenome.nih.gov/
7.
Walter K, Holcomb T, Januario T, Du P, Evangelista M, Kartha N, et al. DNA
methylation profiling defines clinically relevant biological subsets of non-small
cell lung cancer. Clin Cancer Res 2012;18: 2360-73.
8.
Illumina. [cited; Available from: http://www.illumina.com
9.
Dedeurwaerder S, Defrance M, Calonne E, Denis H, Sotiriou C, Fuks F.
Evaluation of the Infinium Methylation 450K technology. Epigenomics 2011;3:
771-84.
10.
Lauss M, Visne I, Kriegner A, Ringner M, Jonsson G, Hoglund M.
Monitoring of technical variation in quantitative high-throughput datasets.
Cancer Inform 2013;12: 193-201.
11.
Hupe P, Stransky N, Thiery JP, Radvanyi F, Barillot E. Analysis of array
CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics
2004;20: 3413-22.
12.
Russnes HG, Vollan HK, Lingjaerde OC, Krasnitz A, Lundin P, Naume B, et
al. Genomic architecture characterizes tumor progression paths and fate in
breast cancer patients. Sci Transl Med 2010;2: 38ra47.
13.
Staaf J, Isaksson S, Karlsson A, Jonsson M, Johansson L, Jonsson P, et al.
Landscape of somatic allelic imbalances and copy number alterations in human
lung carcinoma. International journal of cancer 2012;1: 2020-31.
14.
Staaf J, Jonsson G, Ringner M, Baldetorp B, Borg A. Landscape of somatic
allelic imbalances and copy number alterations in HER2-amplified breast cancer.
Breast Cancer Res 2011;13: R129.
15.
Lindgren D, Frigyesi A, Gudjonsson S, Sjodahl G, Hallden C, Chebil G, et al.
Combined gene expression and genomic profiling define two intrinsic molecular
subtypes of urothelial carcinoma and gene signatures for molecular grading and
outcome. Cancer research 2010;70: 3463-72.
16.
Lauss M, Aine M, Sjodahl G, Veerla S, Patschan O, Gudjonsson S, et al. DNA
methylation analyses of urothelial carcinoma reveal distinct epigenetic subtypes
and an association between gene copy number and methylation status.
Epigenetics 2012;7: 858-67.
17.
Karlsson A, Ringner M, Lauss M, Botling J, Micke P, Planck M, et al.
Genomic and transcriptional alterations in lung adenocarcinoma in relation to
smoking history. Clin Cancer Res 2014;DOI:10.1158/1078-0432.CCR-14-0246.
18.
Wilkerson MD, Yin X, Walter V, Zhao N, Cabanski CR, Hayward MC, et al.
Differential pathogenesis of lung adenocarcinoma subtypes involving sequence
mutations, copy number, chromosomal instability, and methylation. PLoS ONE
2012;7: e36530.
19.
Wilkerson MD, Yin X, Hoadley KA, Liu Y, Hayward MC, Cabanski CR, et al.
Lung squamous cell carcinoma mRNA expression subtypes are reproducible,
clinically important, and correspond to normal cell types. Clin Cancer Res
2010;16: 4864-75.
20.
Carter SL, Eklund AC, Kohane IS, Harris LN, Szallasi Z. A signature of
chromosomal instability inferred from gene expression profiles predicts clinical
outcome in multiple human cancers. Nature genetics 2006;38: 1043-8.
21.
Takeuchi T, Tomida S, Yatabe Y, Kosaka T, Osada H, Yanagisawa K, et al.
Expression profile-defined classification of lung adenocarcinoma shows close
relationship with underlying major genetic changes and clinicopathologic
behaviors. J Clin Oncol 2006;24: 1679-88.
22.
Planck M, Edlund K, Botling J, Micke P, Isaksson S, Staaf J. Genomic and
Transcriptional Alterations in Lung Adenocarcinoma in Relation to EGFR and
KRAS Mutation Status. PLoS ONE 2013;8: e78614.
23.
Wilkerson MD, Hayes DN. ConsensusClusterPlus: a class discovery tool
with confidence assessments and item tracking. Bioinformatics 2010;26: 1572-3.
24.
Fredlund E, Staaf J, Rantala JK, Kallioniemi O, Borg A, Ringner M. The gene
expression landscape of breast cancer is shaped by tumor protein p53 status and
epithelial-mesenchymal transition. Breast Cancer Res 2012;14: R113.
25.
Staaf J, Jonsson G, Jonsson M, Karlsson A, Isaksson S, Salomonsson A, et al.
Relation between smoking history and gene expression profiles in lung
adenocarcinomas. BMC Med Genomics 2012;5: 22.
26.
Huang da W, Sherman BT, Lempicki RA. Systematic and integrative
analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc
2009;4: 44-57.
27.
Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A,
et al. Mutational heterogeneity in cancer and the search for new cancerassociated genes. Nature 2013;499: 214-8.