Download Andreas Mock Cancer Research UK Cambridge Institute, University

Andreas Mock Cancer Research UK Cambridge Institute, University of Cambridge 2017-04-24 Contents  1 Assembly and preprocessing of TCGA RNAseq data  2 Construction of co-expression network  3 Identification of co-expression modules  4 Relation of co-expression modules to sample traits  5 Exploration of individual genes within co-expression module  6 Session information  7 References The following tutorial describes the generation of a weighted co-expression network from TCGA (The Cancer Genome Atlas) RNAseq data using the WGCNA R package by Langfelder and Horvarth1. In addition, individual genes and modules will be related to sample traits. Exemplarly, a co-expression network for skin cutaneous melanomas (SKCM) will be generated. However, the following weighted gene co-expression analysis (WGCNA) framework is applicable to any TCGA tumour entity. The code of this vignette is a proof of principial example that can’t be run as listed without assembling the RNAseq data as described in the following beforehand. 1 Assembly and preprocessing of TCGA RNAseq data Melanoma RNAseq data for the CVE extension were downloaded as expression estimates per gene (RNAseq2 level 3 data) from the TCGA data portal. Please note that the TCGA Data portal is no longer operational and all TCGA data now resides at the Genomic Data Commons. For WGCNA, the individual TCGA RNAseq2 level 3 files were concatenated to a matrix RNAseq with gene symbols as row and TCGA patient barcodes as column names. Further preprocessing included the removal of control samples (for more information see the TCGA Wiki) and expression estimates with counts in less than 20% of cases. RNAseq = RNAseq[apply(RNAseq,1,function(x) sum(x==0))<ncol(RNAseq)*0.8,] To relate co-expression modules to disease phenotypes, clinical metadata is needed. As for the melanoma TCGA data, the clinical data was published as a curated spreadsheet in the supplements of the latest publication (suppl_table_S1D.txt)2. As read counts follow a negative binomial distribution, which has a mathematical theory less tractable than that of the normal distribution, RNAseq data was normalised with the voom methodology3. The voom method estimates the mean-variance of the log-counts and generates a precision weight for each observation. This way, a comparative analysis can be performed with all bioinformatic workflows originally developed for microarray analyses. library(limma) RNAseq_voom = voom(RNAseq)$E A large fraction of genes are not differentially expressed between samples. These have to be excluded from WGCNA, as two genes without notable variance in expression between patients will be highly correlated. As a heuristic cutoff, the top 5000 most variant genes have been used in most WGCNA studies. In detail the median absolute devision (MAD) was used as a robust measure of variability. #transpose matrix to correlate genes in the following WGCNA_matrix = t(RNAseq_voom[order(apply(RNAseq_voom,1,mad), decreasing = T)[1:5000],]) 2 Construction of co-expression network The connections within a network can be fully described by its adjacency matrix aijaij, aNx NN x N matrix whose component aijaijdenotes the connection strength between node ii and jj. The connection strength is defined by the co-expression similarity sijsij. The most widely used method defines sijsij as the absolute value of the correlation coefficient between the profiles of node ii and jj: sij=|cor(xi,xj)|sij=|cor(xi,xj)|. However, we employed the biweight midcorrelation to define sijsij, as it is more robust to outliers4. This feature is pivotal, as we do not expect genes to be co-expressed in all patients. #similarity measure between gene profiles: biweight midcorrelation library(WGCNA) s = abs(bicor(WGCNA_matrix)) Originally, the co-expression similarity matrix was transformed into the adjacency matrix using a ‘hard’ threshold. In these unweighted co-expression networks, two genes were identified to be linked (aij=1aij=1), if the absolute correlation between their expression profiles were higher than a ‘hard’ threshold ττ. However, this hard threshold does not reflect the underlying continuous co-expression measure and leads to a significant loss of information. As a consequence, Horvath and colleagues introduced a new framework for weighted gene co-expression analysis (WGCNA)5. At its core, a weighted adjacency is defined by raising the co-expression similarity to a power (‘soft’ threshold): aij=sβijaij=sijβ with β≥1β≥1. To choose an appropriate ββ-value, the authors present a methodology that assesses the scale free topology of the network. For detailed rational of this approach, please see Zhang and Horvath6. powers = c(c(1:10), seq(from = 12, to=20, by=2)) sft = pickSoftThreshold(WGCNA_matrix, powerVector = powers, verbose = 5) plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2], xlab='Soft Threshold (power)',ylab='Scale Free Topology Model Fit,signed R^2', type='n', main = paste('Scale independence')); text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2], labels=powers,cex=1,col='red'); abline(h=0.90,col='red') As for the melanoma network, a beta value of 3 was the lowest power for which the scale-free topology fit index curve flattens out upon reaching a high value ( R2R2 0.9 as suggested by Langfelder and Horvarth). #calculation of adjacency matrix beta = 3 a = s^beta Lastly, the dissimilarity measure is defined by wij=1−aijwij=1−aij #dissimilarity measure w = 1-a Please note that TOM-based (topological overlap matrix) dissimilarity proposed by Horvarth and colleagues did not result in distinct gene modules for the analysed melanoma network. 3 Identification of co-expression modules To identify co-expression modules, genes are next clustered based on the dissimilarity measure, where branches of the dendrogram correspond to modules. The gene dendrogram obtained by average linkage hierarchical clustering is depicted in figure 2. Ultimately, gene co-expression modules are detected by applying a branch cutting method. We employed the dynamic branch cut method developed by Langfelder and colleagues 7, as constant height cutoffs exhibit suboptimal performance on complicated dendrograms. WGCNA of the 472 TCGA melanoma samples revealed 41 co-expression modules. All genes that are not significantly co-expressed within a module are summarized in an additional module 0 for further analysis. #create gene tree by average linkage hierarchical clustering geneTree = hclust(as.dist(w), method = 'average') #module identification using dynamic tree cut algorithm modules = cutreeDynamic(dendro = geneTree, distM = w, deepSplit = 4, pamRespectsDendro = FALSE, minClusterSize = 30) #assign module colours module.colours = labels2colors(modules) #plot the dendrogram and corresponding colour bars underneath plotDendroAndColors(geneTree, module.colours, 'Module colours', dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05, main='') The relation between the identified co-expression modules can be visualized by a dendrogram of their eigengenes (fig. 3). The module eigengene is defined as the first principal component of its expression matrix. It could be shown that the module= eigengene is highly correlated with the gene that has the highest intramodular connectivity8. library(ape) #calculate eigengenes MEs = moduleEigengenes(WGCNA_matrix, colors = module.colours, excludeGrey = FALSE)$eigengenes #calculate dissimilarity of module eigengenes MEDiss = 1-cor(MEs); #cluster module eigengenes METree = hclust(as.dist(MEDiss), method = 'average'); #plot the result with phytools package par(mar=c(2,2,2,2)) plot.phylo(as.phylo(METree),type = 'fan',show.tip.label = FALSE, main='') tiplabels(frame = 'circle',col='black', text=rep('',length(unique(modules))), bg = levels(as.factor(module.colours))) 4 Relation of co-expression modules to sample traits An advantage of co-expression network analysis is the possibility to integrate external information. At the lowest hierarchical level, gene significance (GS) measures can be defined as the statistical significance (i.e. p-value, pipi) between the ii-th node profile (gene) xixi and the sample trait TT GSi=−log piGSi=−log pi Module significance in turn can be determined as the average absolute gene significance measure. This conceptual framework can be adapted to any research question. The clinical metadata used in the following was obtained from the recent TCGA melanoma publication9 (Supplemental Table S1D: Patient Centric Table). #load clinical metadata. Make sure that patient barcodes are in the same format #create second expression matrix for which the detailed clinical data is available WGCNA_matrix2 = WGCNA_matrix[match(clinical$Name, rownames(WGCNA_matrix)),] #CAVE: 1 sample of detailed clinical metadata is not in downloaded data (TCGA-GN-A269-01') not.available = which(is.na(rownames(WGCNA_matrix2))==TRUE) WGCNA_matrix2 = WGCNA_matrix2[-not.available,] str(WGCNA_matrix2) #hence it needs to be removed from clinical table for further analysis clinical = clinical[-not.available,] Representatively, co-expression modules will be related to the so called lymphocyte score, which summarises the lymphocyte distribution and density in the pathological review. #grouping in high and low lymphocyte score (lscore) lscore = as.numeric(clinical$LYMPHOCYTE.SCORE) lscore[lscore<3] = 0 lscore[lscore>0] = 1 #calculate gene significance measure for lymphocyte score (lscore) - Welch's t-Test GS_lscore = t(sapply(1:ncol(WGCNA_matrix2),function(x)c(t.test(WGCNA_matrix2[,x]~ls core,var.equal=F)$p.value, t.test(WGCNA_matrix2[,x]~lscore,var.equal=F)$estimate[1], t.test(WGCNA_matrix2[,x]~lscore,var.equal=F)$estimate[2]))) GS_lscore = cbind(GS.lscore, abs(GS_lscore[,2] - GS_lscore[,3])) colnames(GS_lscore) = c('p_value','mean_high_lscore','mean_low_lscore', 'effect_size(high-low score)'); rownames(GS_lscore) = colnames(WGCNA_matrix2) To enable a high-level interpretation of the dendrogram of module eigengenes, gene ontology (GO) enrichment analysis the GOstats R package 10. was performed for the module genes using Modules were named according to the most significant GO einrichment given a cutoff for the ontology size. The smaller the ontology size, the more specific the term. In this analysis a cutoff of 100 terms per ontology was chosen. #reference genes = all 5000 top mad genes ref_genes = colnames(WGCNA_matrix2) #create data frame for GO analysis library(org.Hs.eg.db) GO = toTable(org.Hs.egGO); SYMBOL = toTable(org.Hs.egSYMBOL) GO_data_frame = data.frame(GO$go_id, GO$Evidence,SYMBOL$symbol[match(GO$gene_id,SYMBOL$gene_id)]) #create GOAllFrame object library(AnnotationDbi) GO_ALLFrame = GOAllFrame(GOFrame(GO_data_frame, organism = 'Homo sapiens')) #create gene set library(GSEABase) gsc <- GeneSetCollection(GO_ALLFrame, setType = GOCollection()) #perform GO enrichment analysis and save results to list - this make take several minutes library(GEOstats) GSEAGO = vector('list',length(unique(modules))) for(i in 0:(length(unique(modules))-1)){ GSEAGO[[i+1]] = summary(hyperGTest(GSEAGOHyperGParams(name = 'Homo sapiens GO', geneSetCollection = gsc, geneIds = colnames(RNAseq)[modules==i], universeGeneIds = ref.genes, ontology = 'BP', pvalueCutoff = 0.05, conditional = FALSE, testDirection = 'over'))) print(i) } cutoff_size = 100 GO_module_name = rep(NA,length(unique(modules))) for (i in 1:length(unique(modules))){ GO.module.name[i] = GSEAGO[[i]][GSEAGO[[i]]$Size<cutoff_size, ][which(GSEAGO[[i]][GSEAGO[[i]]$Size<cutoff_size,]$Count==max(GSEAGO [[i]][GSEAGO[[i]]$ Size<cutoff.size,]$Count)),7] } GO.module.name[1] = 'module 0' #calculate module significance MS.lscore = as.data.frame(cbind(GS.lscore,modules)) MS.lscore$log_p_value = -log10(as.numeric(MS.lscore$p_value)) MS.lscore = ddply(MS.lscore, .(modules), summarize, mean(log_p_value), sd(log_p_value)) colnames(MS.lscore) = c('modules','pval','sd') MS.lscore.bar = as.numeric(MS.lscore[,2]) MS.lscore.bar[MS.lscore.bar<(-log10(0.05))] = 0 names(MS.lscore.bar) = GO.module.name METree.GO = METree label.order = match(METree$labels,paste0('ME',labels2colors(0:(length(unique(modules) )-1)))) METree.GO$labels = GO.module.name[label.order] plotTree.wBars(as.phylo(METree.GO), MS.lscore.bar, tip.labels = TRUE, scale = 0.2) 5 Exploration of individual genes within co-expression module Assessing the module significance for different sample traits facilitates an understanding of individual co-expression modules for melanoma biology. As for the prioritisation of variants we are next interested in the role of the variant gene within a co-expression module. To this end, Langfelder and Horvath suggest a ‘fuzzy’ measure of module membership defined as Kq=|cor(xi,Eq)|Kq=|cor(xi,Eq)| where xixi is the profile of gene ii and EqEq is the eigengene of module qq. Based on this definition, KK describes how closely related gene ii is to module qq. A meaningful visualization is consequently plotting the module membership over the p-value of the respective GS measure. As a third dimension, the dot-size is weighted according to the effect size. #Calculate module membership MM = abs(bicor(RNAseq, MEs)) #plot individual module of interest (MOI) MOI = 3 #T cell differentiation co-expression module plot(-log10(GS.lscore[modules==MOI,1]), MM[modules==MOI,MOI], pch=20, cex=(GS.lscore[modules==MOI,4]/max(GS.lscore[,4],na.rm=TRUE))*4, xlab='p-value (-log10) lymphocyte score', ylab='membership to module 3') abline(v=-log10(0.05), lty=2, lwd=2) 6 Session information sessionInfo() ## R version 3.4.0 (2017-04-21) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 16.04.2 LTS ## ## Matrix products: default ## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so ## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C ## [5] LC_MONETARY=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 ## [9] LC_ADDRESS=C LC_MESSAGES=en_US.UTF-8 LC_NAME=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] RTCGAToolbox_2.7.0 BiocStyle_2.5.0 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.10 knitr_1.15.1 ## [4] splines_3.4.0 lattice_0.20-35 ## [7] tools_3.4.0 grid_3.4.0 magrittr_1.5 stringr_1.2.0 data.table_1.10.4 ## [10] htmltools_0.3.5 yaml_2.1.14 survival_2.41-3 ## [13] rprojroot_1.2 digest_0.6.12 RJSONIO_1.3-0 ## [16] Matrix_1.2-9 bitops_1.0-6 RCurl_1.95-4.8 ## [19] evaluate_0.10 rmarkdown_1.4 limma_3.33.0 ## [22] stringi_1.1.5 compiler_3.4.0 ## [25] backports_1.0.5 XML_3.98-1.6 RCircos_1.2.0 7 References 1. Peter Langfelder and Steve Horvath. WGCNA: an R package for weighted correlation network analysis. In: BMC Bioinformatics 9 (Jan. 2008), pp. 559–559.↩ 2. Cancer Genome Atlas Network. Genomic Classification of Cutaneous Melanoma. In: Cell 161.7 (June 2015), pp. 1681–1696.↩ 3. Charity W Law et al. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. In: Genome biology 15.2 (Jan. 2014), R29–R29.↩ 4. Chun-Hou Zheng et al. Gene differential coexpression analysis based on biweight correlation and maximum clique. In: BMC bioinformatics 15 Suppl 15 (2014), S3.↩ 5. Bin Zhang and Steve Horvath. A general framework for weighted gene co-expression network analysis. In: Statistical applications in genetics and molecular biology 4 (2005), Article17.↩ 6. Bin Zhang and Steve Horvath. A general framework for weighted gene co-expression network analysis. In: Statistical applications in genetics and molecular biology 4 (2005), Article17.↩ 7. Steve Horvath and Jun Dong. Geometric Interpretation of Gene Coexpression Network Analysis. In: PLoS Computational Biology (PLOSCB) 4(8) 4.8 (2008), e1000117–e1000117.↩ 8. Steve Horvath and Jun Dong. Geometric Interpretation of Gene Coexpression Network Analysis. In: PLoS Computational Biology (PLOSCB) 4(8) 4.8 (2008), e1000117–e1000117.↩ 9. Cancer Genome Atlas Network. Genomic Classification of Cutaneous Melanoma. In: Cell 161.7 (June 2015), pp. 1681–1696.↩ 10. S Falcon and R Gentleman. Using GOstats to test gene lists for GO term association. In: Bioinformatics 23.2 (Jan. 2007), pp. 257–258.↩

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Andreas Mock Cancer Research UK Cambridge Institute, University