Download Andreas Mock Cancer Research UK Cambridge Institute, University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Vectors in gene therapy wikipedia , lookup

NEDD9 wikipedia , lookup

History of genetic engineering wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Oncogenomics wikipedia , lookup

Metagenomics wikipedia , lookup

Minimal genome wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene therapy wikipedia , lookup

Public health genomics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene wikipedia , lookup

Gene desert wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome (book) wikipedia , lookup

Genome evolution wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Andreas Mock Cancer Research UK Cambridge Institute, University of
Cambridge
2017-04-24
Contents

1 Assembly and preprocessing of TCGA RNAseq data

2 Construction of co-expression network

3 Identification of co-expression modules

4 Relation of co-expression modules to sample traits

5 Exploration of individual genes within co-expression module

6 Session information

7 References
The following tutorial describes the generation of a weighted co-expression network from
TCGA (The Cancer Genome Atlas) RNAseq data using the WGCNA R package by Langfelder
and Horvarth1. In addition, individual genes and modules will be related to sample traits.
Exemplarly, a co-expression network for skin cutaneous melanomas (SKCM) will be
generated. However, the following weighted gene co-expression analysis (WGCNA)
framework is applicable to any TCGA tumour entity.
The code of this vignette is a proof of principial example that can’t be run as listed without
assembling the RNAseq data as described in the following beforehand.
1 Assembly and preprocessing of TCGA RNAseq data
Melanoma RNAseq data for the CVE extension were downloaded as expression estimates per
gene (RNAseq2 level 3 data) from the TCGA data portal. Please note that the TCGA Data
portal is no longer operational and all TCGA data now resides at the Genomic Data Commons.
For WGCNA, the individual TCGA RNAseq2 level 3 files were concatenated to a
matrix RNAseq with gene symbols as row and TCGA patient barcodes as column names.
Further preprocessing included the removal of control samples (for more information see
the TCGA Wiki) and expression estimates with counts in less than 20% of cases.
RNAseq = RNAseq[apply(RNAseq,1,function(x) sum(x==0))<ncol(RNAseq)*0.8,]
To relate co-expression modules to disease phenotypes, clinical metadata is needed. As for
the melanoma TCGA data, the clinical data was published as a curated spreadsheet in the
supplements of the latest publication (suppl_table_S1D.txt)2.
As read counts follow a negative binomial distribution, which has a mathematical theory less
tractable than that of the normal distribution, RNAseq data was normalised with
the voom methodology3. The voom method estimates the mean-variance of the log-counts and
generates a precision weight for each observation. This way, a comparative analysis can be
performed with all bioinformatic workflows originally developed for microarray analyses.
library(limma)
RNAseq_voom = voom(RNAseq)$E
A large fraction of genes are not differentially expressed between samples. These have to be
excluded from WGCNA, as two genes without notable variance in expression between
patients will be highly correlated. As a heuristic cutoff, the top 5000 most variant genes have
been used in most WGCNA studies. In detail the median absolute devision (MAD) was used
as a robust measure of variability.
#transpose matrix to correlate genes in the following
WGCNA_matrix = t(RNAseq_voom[order(apply(RNAseq_voom,1,mad), decreasing =
T)[1:5000],])
2 Construction of co-expression network
The connections within a network can be fully described by its adjacency matrix aijaij,
aNx
NN x N matrix whose component aijaijdenotes the connection strength between
node ii and jj. The connection strength is defined by the co-expression similarity sijsij. The
most widely used method defines sijsij as the absolute value of the correlation coefficient
between the profiles of node ii and jj: sij=|cor(xi,xj)|sij=|cor(xi,xj)|. However, we employed the
biweight midcorrelation to define sijsij, as it is more robust to outliers4. This feature is pivotal,
as we do not expect genes to be co-expressed in all patients.
#similarity measure between gene profiles: biweight midcorrelation
library(WGCNA)
s = abs(bicor(WGCNA_matrix))
Originally, the co-expression similarity matrix was transformed into the adjacency matrix using
a ‘hard’ threshold. In these unweighted co-expression networks, two genes were identified to
be linked (aij=1aij=1), if the absolute correlation between their expression profiles were higher
than a ‘hard’ threshold ττ. However, this hard threshold does not reflect the underlying
continuous co-expression measure and leads to a significant loss of information. As a
consequence, Horvath and colleagues introduced a new framework for weighted gene
co-expression analysis (WGCNA)5. At its core, a weighted adjacency is defined by raising the
co-expression similarity to a power (‘soft’ threshold):
aij=sβijaij=sijβ
with β≥1β≥1. To choose an appropriate ββ-value, the authors present a methodology that
assesses the scale free topology of the network. For detailed rational of this approach, please
see Zhang and Horvath6.
powers = c(c(1:10), seq(from = 12, to=20, by=2))
sft = pickSoftThreshold(WGCNA_matrix, powerVector = powers, verbose = 5)
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab='Soft Threshold (power)',ylab='Scale Free Topology Model
Fit,signed R^2',
type='n', main = paste('Scale independence'));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=1,col='red'); abline(h=0.90,col='red')
As for the melanoma network, a beta value of 3 was the lowest power for which the scale-free
topology fit index curve flattens out upon reaching a high value ( R2R2 0.9 as suggested by
Langfelder and Horvarth).
#calculation of adjacency matrix
beta = 3
a = s^beta
Lastly, the dissimilarity measure is defined by
wij=1−aijwij=1−aij
#dissimilarity measure
w = 1-a
Please note that TOM-based (topological overlap matrix) dissimilarity proposed by Horvarth
and colleagues did not result in distinct gene modules for the analysed melanoma network.
3 Identification of co-expression modules
To identify co-expression modules, genes are next clustered based on the dissimilarity
measure, where branches of the dendrogram correspond to modules. The gene dendrogram
obtained by average linkage hierarchical clustering is depicted in figure 2. Ultimately, gene
co-expression modules are detected by applying a branch cutting method. We employed the
dynamic branch cut method developed by Langfelder and colleagues 7, as constant height
cutoffs exhibit suboptimal performance on complicated dendrograms. WGCNA of the 472
TCGA melanoma samples revealed 41 co-expression modules. All genes that are not
significantly co-expressed within a module are summarized in an additional module 0 for
further analysis.
#create gene tree by average linkage hierarchical clustering
geneTree = hclust(as.dist(w), method = 'average')
#module identification using dynamic tree cut algorithm
modules = cutreeDynamic(dendro = geneTree, distM = w, deepSplit = 4,
pamRespectsDendro = FALSE,
minClusterSize = 30)
#assign module colours
module.colours = labels2colors(modules)
#plot the dendrogram and corresponding colour bars underneath
plotDendroAndColors(geneTree, module.colours, 'Module colours',
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05, main='')
The relation between the identified co-expression modules can be visualized by a dendrogram
of their eigengenes (fig. 3). The module eigengene is defined as the first principal component
of its expression matrix. It could be shown that the module= eigengene is highly correlated
with the gene that has the highest intramodular connectivity8.
library(ape)
#calculate eigengenes
MEs = moduleEigengenes(WGCNA_matrix, colors = module.colours, excludeGrey
= FALSE)$eigengenes
#calculate dissimilarity of module eigengenes
MEDiss = 1-cor(MEs);
#cluster module eigengenes
METree = hclust(as.dist(MEDiss), method = 'average');
#plot the result with phytools package
par(mar=c(2,2,2,2))
plot.phylo(as.phylo(METree),type = 'fan',show.tip.label = FALSE, main='')
tiplabels(frame = 'circle',col='black',
text=rep('',length(unique(modules))), bg =
levels(as.factor(module.colours)))
4 Relation of co-expression modules to sample traits
An advantage of co-expression network analysis is the possibility to integrate external
information. At the lowest hierarchical level, gene significance (GS) measures can be defined
as the statistical significance (i.e. p-value, pipi) between the ii-th node profile (gene) xixi and
the sample trait TT
GSi=−log piGSi=−log pi
Module significance in turn can be determined as the average absolute gene significance
measure. This conceptual framework can be adapted to any research question. The clinical
metadata used in the following was obtained from the recent TCGA melanoma
publication9 (Supplemental Table S1D: Patient Centric Table).
#load clinical metadata. Make sure that patient barcodes are in the same
format
#create second expression matrix for which the detailed clinical data is
available
WGCNA_matrix2 = WGCNA_matrix[match(clinical$Name,
rownames(WGCNA_matrix)),]
#CAVE: 1 sample of detailed clinical metadata is not in downloaded data
(TCGA-GN-A269-01')
not.available = which(is.na(rownames(WGCNA_matrix2))==TRUE)
WGCNA_matrix2 = WGCNA_matrix2[-not.available,]
str(WGCNA_matrix2)
#hence it needs to be removed from clinical table for further analysis
clinical = clinical[-not.available,]
Representatively, co-expression modules will be related to the so called lymphocyte score,
which summarises the lymphocyte distribution and density in the pathological review.
#grouping in high and low lymphocyte score (lscore)
lscore = as.numeric(clinical$LYMPHOCYTE.SCORE)
lscore[lscore<3] = 0
lscore[lscore>0] = 1
#calculate gene significance measure for lymphocyte score (lscore) - Welch's
t-Test
GS_lscore =
t(sapply(1:ncol(WGCNA_matrix2),function(x)c(t.test(WGCNA_matrix2[,x]~ls
core,var.equal=F)$p.value,
t.test(WGCNA_matrix2[,x]~lscore,var.equal=F)$estimate[1],
t.test(WGCNA_matrix2[,x]~lscore,var.equal=F)$estimate[2])))
GS_lscore = cbind(GS.lscore, abs(GS_lscore[,2] - GS_lscore[,3]))
colnames(GS_lscore) = c('p_value','mean_high_lscore','mean_low_lscore',
'effect_size(high-low score)'); rownames(GS_lscore) =
colnames(WGCNA_matrix2)
To enable a high-level interpretation of the dendrogram of module eigengenes, gene ontology
(GO)
enrichment
analysis
the GOstats R package
10.
was
performed
for
the
module
genes
using
Modules were named according to the most significant GO
einrichment given a cutoff for the ontology size. The smaller the ontology size, the more
specific the term. In this analysis a cutoff of 100 terms per ontology was chosen.
#reference genes = all 5000 top mad genes
ref_genes = colnames(WGCNA_matrix2)
#create data frame for GO analysis
library(org.Hs.eg.db)
GO = toTable(org.Hs.egGO); SYMBOL = toTable(org.Hs.egSYMBOL)
GO_data_frame = data.frame(GO$go_id,
GO$Evidence,SYMBOL$symbol[match(GO$gene_id,SYMBOL$gene_id)])
#create GOAllFrame object
library(AnnotationDbi)
GO_ALLFrame = GOAllFrame(GOFrame(GO_data_frame, organism = 'Homo sapiens'))
#create gene set
library(GSEABase)
gsc <- GeneSetCollection(GO_ALLFrame, setType = GOCollection())
#perform GO enrichment analysis and save results to list - this make take
several minutes
library(GEOstats)
GSEAGO = vector('list',length(unique(modules)))
for(i in 0:(length(unique(modules))-1)){
GSEAGO[[i+1]] = summary(hyperGTest(GSEAGOHyperGParams(name = 'Homo
sapiens GO',
geneSetCollection = gsc, geneIds =
colnames(RNAseq)[modules==i],
universeGeneIds = ref.genes, ontology = 'BP', pvalueCutoff =
0.05,
conditional = FALSE, testDirection = 'over')))
print(i)
}
cutoff_size = 100
GO_module_name = rep(NA,length(unique(modules)))
for (i in 1:length(unique(modules))){
GO.module.name[i] =
GSEAGO[[i]][GSEAGO[[i]]$Size<cutoff_size,
][which(GSEAGO[[i]][GSEAGO[[i]]$Size<cutoff_size,]$Count==max(GSEAGO
[[i]][GSEAGO[[i]]$
Size<cutoff.size,]$Count)),7]
}
GO.module.name[1] = 'module 0'
#calculate module significance
MS.lscore = as.data.frame(cbind(GS.lscore,modules))
MS.lscore$log_p_value = -log10(as.numeric(MS.lscore$p_value))
MS.lscore = ddply(MS.lscore, .(modules), summarize, mean(log_p_value),
sd(log_p_value))
colnames(MS.lscore) = c('modules','pval','sd')
MS.lscore.bar = as.numeric(MS.lscore[,2])
MS.lscore.bar[MS.lscore.bar<(-log10(0.05))] = 0
names(MS.lscore.bar) = GO.module.name
METree.GO = METree
label.order =
match(METree$labels,paste0('ME',labels2colors(0:(length(unique(modules)
)-1))))
METree.GO$labels = GO.module.name[label.order]
plotTree.wBars(as.phylo(METree.GO), MS.lscore.bar, tip.labels = TRUE,
scale = 0.2)
5 Exploration of individual genes within co-expression
module
Assessing the module significance for different sample traits facilitates an understanding of
individual co-expression modules for melanoma biology. As for the prioritisation of variants we
are next interested in the role of the variant gene within a co-expression module. To this end,
Langfelder and Horvath suggest a ‘fuzzy’ measure of module membership defined as
Kq=|cor(xi,Eq)|Kq=|cor(xi,Eq)|
where xixi is the profile of gene ii and EqEq is the eigengene of module qq. Based on this
definition, KK describes how closely related gene ii is to module qq. A meaningful
visualization is consequently plotting the module membership over the p-value of the
respective GS measure. As a third dimension, the dot-size is weighted according to the effect
size.
#Calculate module membership
MM = abs(bicor(RNAseq, MEs))
#plot individual module of interest (MOI)
MOI = 3 #T cell differentiation co-expression module
plot(-log10(GS.lscore[modules==MOI,1]), MM[modules==MOI,MOI], pch=20,
cex=(GS.lscore[modules==MOI,4]/max(GS.lscore[,4],na.rm=TRUE))*4,
xlab='p-value (-log10) lymphocyte score', ylab='membership to module
3')
abline(v=-log10(0.05), lty=2, lwd=2)
6 Session information
sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8
LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8
LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8
## [9] LC_ADDRESS=C
LC_MESSAGES=en_US.UTF-8
LC_NAME=C
LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats
graphics grDevices utils
datasets methods
base
##
## other attached packages:
## [1] RTCGAToolbox_2.7.0 BiocStyle_2.5.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.10
knitr_1.15.1
## [4] splines_3.4.0
lattice_0.20-35
## [7] tools_3.4.0
grid_3.4.0
magrittr_1.5
stringr_1.2.0
data.table_1.10.4
## [10] htmltools_0.3.5
yaml_2.1.14
survival_2.41-3
## [13] rprojroot_1.2
digest_0.6.12
RJSONIO_1.3-0
## [16] Matrix_1.2-9
bitops_1.0-6
RCurl_1.95-4.8
## [19] evaluate_0.10
rmarkdown_1.4
limma_3.33.0
## [22] stringi_1.1.5
compiler_3.4.0
## [25] backports_1.0.5
XML_3.98-1.6
RCircos_1.2.0
7 References
1. Peter Langfelder and Steve Horvath. WGCNA: an R package for weighted correlation
network analysis. In: BMC Bioinformatics 9 (Jan. 2008), pp. 559–559.↩
2. Cancer Genome Atlas Network. Genomic Classification of Cutaneous Melanoma. In:
Cell 161.7 (June 2015), pp. 1681–1696.↩
3. Charity W Law et al. voom: Precision weights unlock linear model analysis tools for
RNA-seq read counts. In: Genome biology 15.2 (Jan. 2014), R29–R29.↩
4. Chun-Hou Zheng et al. Gene differential coexpression analysis based on biweight
correlation and maximum clique. In: BMC bioinformatics 15 Suppl 15 (2014), S3.↩
5. Bin Zhang and Steve Horvath. A general framework for weighted gene co-expression
network analysis. In: Statistical applications in genetics and molecular biology 4
(2005), Article17.↩
6. Bin Zhang and Steve Horvath. A general framework for weighted gene co-expression
network analysis. In: Statistical applications in genetics and molecular biology 4
(2005), Article17.↩
7. Steve Horvath and Jun Dong. Geometric Interpretation of Gene Coexpression
Network Analysis. In: PLoS Computational Biology (PLOSCB) 4(8) 4.8 (2008),
e1000117–e1000117.↩
8. Steve Horvath and Jun Dong. Geometric Interpretation of Gene Coexpression
Network Analysis. In: PLoS Computational Biology (PLOSCB) 4(8) 4.8 (2008),
e1000117–e1000117.↩
9. Cancer Genome Atlas Network. Genomic Classification of Cutaneous Melanoma. In:
Cell 161.7 (June 2015), pp. 1681–1696.↩
10. S Falcon and R Gentleman. Using GOstats to test gene lists for GO term association.
In: Bioinformatics 23.2 (Jan. 2007), pp. 257–258.↩