* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download R and BioConductor
Gene therapy of the human retina wikipedia , lookup
Pathogenomics wikipedia , lookup
Gene therapy wikipedia , lookup
Public health genomics wikipedia , lookup
Genome evolution wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Gene desert wikipedia , lookup
Genome (book) wikipedia , lookup
Gene nomenclature wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Metagenomics wikipedia , lookup
Microevolution wikipedia , lookup
Gene expression programming wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Designer baby wikipedia , lookup
R and BioConductor • R: free software (under restriction) for statistical analysis Application of R in High-Throughput Transcriptomic Data Analysis – Consists of packages (套件) and function (函式) • BioConductor: R software project for the analysis of biomedical and genomic data Li-yu D Liu, Ph.D. 劉力瑜 副教授 Dept. of Agronomy, Biometry Division National Taiwan University 國立台灣大學農藝學系生物統計組 • 在R的提示符號下輸入: • 待提示符號再次出現時, 輸入: biocLite() Note: 安裝時間略長,待提示符號再次出現時, 表示安裝結束, 才可關閉視窗 3 4 HT Transcriptomic Data • Microarray • RNA-seq HT Transcriptomic Data • Microarray Workflow • RNA-seq Data Import Data import • Data preparation: rows – genes; columns – sample perou.tab Preprocessing* Visualization A10.BE A100.BE A101.BE A102.BE A104.BE A109.BE -0.635136257 0.05944527 0.295229309 0.044984367 -0.287152458 -0.37941922 -0.467818713 0.799491038 0.976087902 0.357995583 0.130832052 -0.551488048 -1.053529634 2.082635494 0.533484608 0.619339759 -0.12865468 -0.531352043 0.036382357 1.97379493 0.778467649 0.884597921 0.521095977 -0.291980555 0.639715605 1.301422591 1.638604466 1.614453552 1.489951333 0.404389806 0.578985072 0.495695163 0.745920541 0.284848118 -0.095009627 0.027397548 -0.466207433 0.14974712 -0.479759813 -0.143049357 -1.168184358 -0.428626079 0.948059548 2.277515274 2.583034494 1.040525663 1.168658721 0.111174239 0.486819386 1.407375595 0.79481675 0.705755013 0.385154897 -0.095643077 0.756940439 0.52679799 0.430403039 0.245687928 0.517045852 0.007931734 0.708313508 0.480945459 0.637726158 0.630143563 0.589374242 -0.397081394 0.854586173 0.925208757 0.719675866 0.2769754 0.359690955 -0.262695701 … … DE analysis* Gene ZFX ZNF133 MLL DSCR1L1 WNT5A VHL UCP3 UNG UGT2B15 CDC34 UQCRH TCF3 Adjust p-values for multiple comparisons • Data import to R: > d = read.delim(file.choose(), colClasses = rep(c("character","numeric"), c(1,40))) Cluster analysis * Different methods are used for microarray and RNA-seq data 8 Pre-processing • Pre-processing includes steps that extract or enhance meaningful data characteristics. (e.g. taking the logarithm of the raw values. Two-channel microarray data: Affymetrix: • Background correction. • • Eliminate the spots flagged in the image processing stage. Apply a floor function to bring all negative values to a small positive value. • Calculate the ratio of the two channels (cy5/cy3). • Apply a logarithmic transformation. • Apply a logarithmic transformation. • Normalize overall array intensity. • Combine replicate probes to calculate an overall expression value. Location Normalization • Xj norm = Xj – l – l can be identical for all values of Xj : • Global median normalization: l = median(Xj) – l can vary: • Global Lowess/Loess: l = the expected value after locally fitting a smooth regression curve on the global MA plot. • Print-tip Lowess/Loess: l = the loess/lowess fit to the MA-plot for different print-tips. Introduction • Normalization is a particular type of preprocessing done to eliminate systematic differences across data sets. Before normalization After normalization Loess Normalization Result Global Loess Print-tip Lowess/Loess Quantile Normalization • Assumption: The measurements from different arrays share the same underlying distribution. 0.974[4] 0.341[3] 0.411[2] -0.951[1] = (-0.862 – 1.461 – 0.530)/3 1.857[5] -0.036[2] 0.634[3] -0.055[2] -0.386[3] -1.461[1] 0.885[4] 0.196[3] -0.539[2] 0.368[4] -0.530[1] 0.742[4] -0.862[1] 2.634[5] 2.340[5] 2.277[5] 0.742[4] 0.196[3] -0.055[2] 2.277[5] -0.055[2] 0.196[3] It forces arrays have an identical distribution: 0.196[3] -0.951[1] 0.742[4] -0.055[2] 0.742[4] -0.951[1] -0.951[1] 2.277[5] 2.277[5] R practice (normalization) > x=read.delim(file.choose(),colClasses=rep(c("character","numeric"),c(1,40))) > xm = x[,-1] > library(limma) > xq = normalizeQuantiles(xm) # quantile normalization for log ratios > plot(density(xm[,1],na.rm=TRUE),ylim=c(0,0.7)) > for(i in 2:40) lines(density(xm[,i],na.rm=TRUE)) > xn = normalizeCyclicLoess(xm) # loess normalization for log ratios > plot(density(xn[,1],na.rm=TRUE),ylim=c(0,0.7)) > for(i in 2:40) lines(density(xn[,i],na.rm=TRUE)) Data Visualization • Showing expression level of one gene (sample): – histogram – box plot • Comparing expression levels of two genes (samples): – side-by-side box plot – scatter plot and/or MA plot • Presenting the similarity among multiple genes (samples): – side-by-side box plot – pairwise scatter plot and/or MA plot – heatmap Histogram • Histogram: the graph shows the frequency distribution of the values in a given data set. Step 1: Fractionate the entire range of values encountered in the data set into several intervals (bins). Step2: Draw a bar for each bin and the height of the bar will be equal to the number of values falling in the interval represented by the bin. Side-by-side Box Plot For more than two genes (samples) Box plots outlier(s): observations that are greater than UQ+1.5*IQD or less than LQ-1.5*IQD The largest observation that is smaller than UQ+1.5*IQD Upper Quantile (UQ) Median IQD = UQ-LQ Lower Quantile (LQ) The smallest observation that is greater than LQ-1.5*IQD Scatter Plots • For example, suppose a gene G has an expression level of e1 in the 1st sample and that of e2 in the 2nd sample, the point representing G will be plotted at coordinates (e1, e2) in the scatter plot. Scatter Plots Scatter Plots Example: Dye swap -- the banana shaped blob indicates nonlinear dye effect. Cy5 banana shape (Cy3 > Cy5) • Scatter plots allow us to observe certain important features of the data: Cy3 Note: Genes with similar expression levels in two experiments will appear around the first diagonal of the coordinate system. Scatter Plots v.s. MA Plots • The MA plot is a variant of the scatter plot. Let e1 be the expression in the 1st sample and e2 be the expression in 2nd sample, M = log(e1) – log(e2) = log(e1 / e2) A = (log(e1) + log(e2))/2 The MA plot is the scatter plot of M (y-axis) against A (x-axis). Scatter Plots v.s. MA Plots M = log(y) – log(x) A = (log(y) + log(x))/2 Note: Genes with similar expression levels in two experiments will appear around the horizontal line y = 0. Scatter Plots v.s. MA Plots • Limitation for scatter and MA plots --- can only be plotted in two or three dimensions Heatmaps • A heatmap is a two-dimensional, rectangular, colored grid. It displays data that themselves come in the form of a rectangular matrix: The color of each rectangle is determined by the value of the corresponding entry in the matrix. The rows and columns of the matrix are rearranged independently so that similar rows and columns are placed next to each other, respectively. R practice (Visualization) > library(gplots) > xq = as.matrix(xq) > hist(unlist(xq[1,])) # histogram for the 1st gene > boxplot(xq[1,]) # boxplot for the 1st gene > xn = as.matrix(xn) > boxplot(xn) > boxplot(xn,las=2) > pairs(xn[,1:3]) R practice (Visualization) > xnd = data.frame(xn) > plot(xnd$A10.BE, xnd$A10.AF) > plot(xnd$A10.BE, xnd$A10.AF, col=densCols(xnd$A10.BE, xnd$A10.AF)) > abline(0,1,col="red",lwd=2) # add reference line > ### MA plot > M = xnd$A10.AF - xnd$A10.BE > A = (xnd$A10.AF + xnd$A10.BE)/2 > plot(A, M, col=densCols(A,M)) R practice (Visualization) Normalization for RNAseq > library(gplots) > heatmap(xn[1:100,]) > heatmap(xn[1:100,],col=greenred(256)) > heatmap(xn[1:100,],col=bluered(256)) # colorpanel > heatmap.2(xn[1:100,],col=bluered(256)) # no scaling; with color key > heatmap.2(xn[1:100,],col=bluered(256),trace="none") > heatmap.2(xn[1:100,],col=bluered(256),trace="none",labRow=x[1:100,1]) There are two main sources of systematic variability that require normalization. (1) RNA fragmentation during library construction causes longer transcripts to generate more reads compared to shorter transcripts present at the same abundance in the sample (3&4). (2) The variability in the number of reads produced for each run causes fluctuations in the number of fragments mapped across samples (1&2). Normalization for RNAseq • Single-end reads: use reads per kilobase of transcript per million mapped reads (RPKM) metric 109 x R / (N x L) • Pair-end reads: use analogous fragments per kilobase of transcript per million mapped reads (FPKM) metric Scaling Method in DESeq Workflow DE Analysis Data import • In many cases, the purpose of microarray experiment is to compare the gene expression levels in two or several predetermined classes. Preprocessing* Visualization DE analysis* – The comparison is often performed under gene-bygene basis. – For the convenient interpretability, comparisons usually ignore the dependencies between genes. Adjust p-values for multiple comparisons Cluster analysis * Different methods are used for microarray and RNA-seq data Fold Change • Fold change is the important and intuitive approach to find differentially regulated genes: DE Analysis for Microarray Data Type Expression of Experimental Sample Fold change (FC) = Expression of Reference Sample Paired Data Unpaired Data Complex Data (More than two groups) Paired t-test Two-sample ttest Analysis of Variance (ANOVA) Parametric Tests • It may be the only possibility in cases where no replicates are available. • The fold change is chosen arbitrarily and cannot access the level of significance. Assumptions: Normality; equal-variance Nonparametric Tests Wilcoxon Wilcoxon ranksigned-rank test sum test Numerical methods Permutation tests Monte-Carlo permutation tests Kruskal-Wallis test R practice (DE analysis) > trt = gl(2,20,labels=c("BE","AF")) > summary(aov(xn[1,]~trt))[[1]]$Pr[1] • DESeq (DESeq2) is an BioC package: – Assume the read counts are distributed as negative binomial (NB) distribution. > library(lmPerm) > summary(aovp(xn[1,]~trt))[[1]]$Pr[1] # permutation p-value 1. Estimate the variance for NB distribution 2. Hypothesis testing under NB distribution # anova (t-test) for each gene: > p = c() > for (i in 1:nrow(xn)){ + p[i] = summary(aov(xn[1,]~trt))[[1]]$Pr[1] +} > DESeq2 • Input from count matrix: gene Gene_00001 Gene_00002 Gene_00003 Gene_00004 Gene_00005 Gene_00006 Gene_00007 Gene_00008 Gene_00009 Gene_00010 Gene_00011 Gene_00012 Gene_00013 (18761 genes) T1a 0 20 3 75 10 129 13 0 202 10 2 104 6 T1b 0 8 0 84 16 126 4 3 122 8 3 60 6 DE Analysis for RNAseq T2 2 12 2 241 4 451 21 0 256 56 5 218 22 T3 0 5 0 149 0 223 19 0 43 145 0 213 13 DESeq2 ctData.tab N1 0 19 0 271 4 243 31 0 287 14 3 111 15 N2 1 26 0 257 10 149 4 0 357 15 0 121 6 (6 samples) > library('DESeq2') > sampleCountData = read.delim("data/ctData.tab", + colClasses=rep(c("character","integer"),c(1,6)),row.names=1) > sampleCondition = + c("treated","treated","treated","treated","untreated","untreated") > sampleColData = DataFrame(condition=as.factor(sampleCondition), + row.names=colnames(sampleCountData)) > dds = DESeqDataSetFromMatrix(countData = sampleCountData, colData = sampleColData, design = ~ condition) > colData(dds)$condition = relevel(colData(dds)$condition, "untreated") DESeq2 DESeq2 > dds = DESeq(dds) > res = results(dds) > res = res[order(res$padj),] > plotMA(dds) > write.csv(as.data.frame(res), file="condition_treated_results.csv") # LRT for mutiple levels > colData(dds)$condition = as.factor(c("t1","t1","t2","t2","ctrl","ctrl")) > colData(dds)$condition = relevel(colData(dds)$condition, "ctrl") > ddsLRT = DESeq(dds,test="LRT", reduced= ~ 1) > resLRT=results(ddsLRT) > mcols(ddsLRT,use.names=TRUE)[1:3,] # when there is no replicate > trt = c("T1a","T1b") > dds.short = DESeqDataSetFromMatrix(countData = sampleCountData[,1:2], + colData = DataFrame(condition=as.factor(trt), row.names=trt), + design = ~ condition) > dds.short = DESeq(dds.short) > plotMA(dds.short) Workflow Hypothesis Testing in Microarray Study Data import • In all of the Microarray datasets, we are interested in identifying differentially expressed genes. Preprocessing* Visualization DE analysis* Adjust p-values for multiple comparisons Cluster analysis • The method would then be applied to every gene (one gene at a time) on the microarray in order to identify those genes that are differentially expressed Control of FDR! * Different methods are used for microarray and RNA-seq data FDR Microarray: DESeq2: