Download R and BioConductor

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene therapy of the human retina wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene therapy wikipedia , lookup

Gene wikipedia , lookup

Public health genomics wikipedia , lookup

Genome evolution wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene desert wikipedia , lookup

Genome (book) wikipedia , lookup

Gene nomenclature wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Metagenomics wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression programming wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
R and BioConductor
• R: free software (under restriction) for statistical
analysis
Application of R in High-Throughput
Transcriptomic Data Analysis
– Consists of packages (套件) and function (函式)
• BioConductor: R software project for the
analysis of biomedical and genomic data
Li-yu D Liu, Ph.D.
劉力瑜 副教授
Dept. of Agronomy, Biometry Division
National Taiwan University
國立台灣大學農藝學系生物統計組
• 在R的提示符號下輸入:
• 待提示符號再次出現時, 輸入: biocLite()
Note: 安裝時間略長,待提示符號再次出現時, 表示安裝結束, 才可關閉視窗
3
4
HT Transcriptomic Data
• Microarray
• RNA-seq
HT Transcriptomic Data
• Microarray
Workflow
• RNA-seq
Data Import
Data import
• Data preparation: rows – genes; columns – sample
perou.tab
Preprocessing*
Visualization
A10.BE
A100.BE
A101.BE
A102.BE
A104.BE
A109.BE
-0.635136257 0.05944527 0.295229309 0.044984367 -0.287152458 -0.37941922
-0.467818713 0.799491038 0.976087902 0.357995583 0.130832052 -0.551488048
-1.053529634 2.082635494 0.533484608 0.619339759 -0.12865468 -0.531352043
0.036382357 1.97379493 0.778467649 0.884597921 0.521095977 -0.291980555
0.639715605 1.301422591 1.638604466 1.614453552 1.489951333 0.404389806
0.578985072 0.495695163 0.745920541 0.284848118 -0.095009627 0.027397548
-0.466207433 0.14974712 -0.479759813 -0.143049357 -1.168184358 -0.428626079
0.948059548 2.277515274 2.583034494 1.040525663 1.168658721 0.111174239
0.486819386 1.407375595 0.79481675 0.705755013 0.385154897 -0.095643077
0.756940439 0.52679799 0.430403039 0.245687928 0.517045852 0.007931734
0.708313508 0.480945459 0.637726158 0.630143563 0.589374242 -0.397081394
0.854586173 0.925208757 0.719675866
0.2769754 0.359690955 -0.262695701
…
…
DE analysis*
Gene
ZFX
ZNF133
MLL
DSCR1L1
WNT5A
VHL
UCP3
UNG
UGT2B15
CDC34
UQCRH
TCF3
Adjust p-values for
multiple comparisons
• Data import to R:
> d = read.delim(file.choose(), colClasses = rep(c("character","numeric"), c(1,40)))
Cluster analysis
* Different methods are used for
microarray and RNA-seq data
8
Pre-processing
• Pre-processing includes steps that extract or
enhance meaningful data characteristics. (e.g.
taking the logarithm of the raw values.
Two-channel microarray data:
Affymetrix:
•
Background correction.
•
•
Eliminate the spots flagged in the
image processing stage.
Apply a floor function to
bring all negative values
to a small positive value.
•
Calculate the ratio of the two
channels (cy5/cy3).
•
Apply a logarithmic
transformation.
•
Apply a logarithmic transformation.
•
Normalize overall array intensity.
•
Combine replicate probes to calculate an overall expression value.
Location Normalization
• Xj norm = Xj – l
– l can be identical for all values of Xj :
• Global median normalization: l = median(Xj)
– l can vary:
• Global Lowess/Loess: l = the expected value after locally
fitting a smooth regression curve on the global MA plot.
• Print-tip Lowess/Loess: l = the loess/lowess fit to the MA-plot
for different print-tips.
Introduction
• Normalization is a particular type of preprocessing done to eliminate systematic
differences across data sets.
Before normalization
After normalization
Loess Normalization Result
Global Loess
Print-tip Lowess/Loess
Quantile Normalization
• Assumption: The measurements from different arrays
share the same underlying distribution.
0.974[4]
0.341[3]
0.411[2]
-0.951[1] = (-0.862 – 1.461 – 0.530)/3
1.857[5]
-0.036[2]
0.634[3]
-0.055[2]
-0.386[3]
-1.461[1]
0.885[4]
0.196[3]
-0.539[2]
0.368[4]
-0.530[1]
0.742[4]
-0.862[1]
2.634[5]
2.340[5]
2.277[5]
0.742[4]
0.196[3]
-0.055[2]
2.277[5]
-0.055[2]
0.196[3]
 It forces arrays have an identical distribution:
0.196[3]
-0.951[1]
0.742[4]
-0.055[2]
0.742[4]
-0.951[1]
-0.951[1]
2.277[5]
2.277[5]
R practice (normalization)
> x=read.delim(file.choose(),colClasses=rep(c("character","numeric"),c(1,40)))
> xm = x[,-1]
> library(limma)
> xq = normalizeQuantiles(xm) # quantile normalization for log ratios
> plot(density(xm[,1],na.rm=TRUE),ylim=c(0,0.7))
> for(i in 2:40) lines(density(xm[,i],na.rm=TRUE))
> xn = normalizeCyclicLoess(xm) # loess normalization for log ratios
> plot(density(xn[,1],na.rm=TRUE),ylim=c(0,0.7))
> for(i in 2:40) lines(density(xn[,i],na.rm=TRUE))
Data Visualization
• Showing expression level of one gene (sample):
– histogram
– box plot
• Comparing expression levels of two genes (samples):
– side-by-side box plot
– scatter plot and/or MA plot
• Presenting the similarity among multiple genes (samples):
– side-by-side box plot
– pairwise scatter plot and/or MA plot
– heatmap
Histogram
• Histogram: the graph shows the frequency
distribution of the values in a given data set.
Step 1: Fractionate the
entire range of values
encountered in the data set
into several intervals (bins).
Step2: Draw a bar for each
bin and the height of the
bar will be equal to the
number of values falling in
the interval represented by
the bin.
Side-by-side Box Plot
For more than two genes (samples)
Box plots
outlier(s): observations that are
greater than UQ+1.5*IQD or less than
LQ-1.5*IQD
The largest observation that is
smaller than UQ+1.5*IQD
Upper Quantile (UQ)
Median
IQD = UQ-LQ
Lower Quantile (LQ)
The smallest observation that is
greater than LQ-1.5*IQD
Scatter Plots
• For example, suppose a gene G has an
expression level of e1 in the 1st sample
and that of e2 in the 2nd sample, the point
representing G will be plotted at
coordinates (e1, e2) in the scatter plot.
Scatter Plots
Scatter Plots
Example: Dye swap -- the
banana shaped blob
indicates nonlinear dye
effect.
Cy5
banana shape (Cy3 > Cy5)
• Scatter plots allow
us to observe
certain important
features of the data:
Cy3
Note: Genes with similar expression levels in two experiments will appear around the
first diagonal of the coordinate system.
Scatter Plots v.s. MA Plots
• The MA plot is a variant of the scatter plot.
Let e1 be the expression in the 1st sample and
e2 be the expression in 2nd sample,
M = log(e1) – log(e2) = log(e1 / e2)
A = (log(e1) + log(e2))/2
The MA plot is the scatter plot of M (y-axis)
against A (x-axis).
Scatter Plots v.s. MA Plots
M = log(y) – log(x)
A = (log(y) + log(x))/2
Note: Genes with
similar expression
levels in two
experiments will
appear around the
horizontal line y = 0.
Scatter Plots v.s. MA Plots
• Limitation for
scatter and MA
plots --- can only
be plotted in two
or three
dimensions
Heatmaps
• A heatmap is a two-dimensional, rectangular, colored
grid. It displays data that themselves come in the form of
a rectangular matrix:
The color of each
rectangle is determined
by the value of the
corresponding entry in
the matrix.
The rows and columns of
the matrix are rearranged
independently so that
similar rows and columns
are placed next to each
other, respectively.
R practice (Visualization)
> library(gplots)
> xq = as.matrix(xq)
> hist(unlist(xq[1,])) # histogram for the 1st gene
> boxplot(xq[1,]) # boxplot for the 1st gene
> xn = as.matrix(xn)
> boxplot(xn)
> boxplot(xn,las=2)
> pairs(xn[,1:3])
R practice (Visualization)
> xnd = data.frame(xn)
> plot(xnd$A10.BE, xnd$A10.AF)
> plot(xnd$A10.BE, xnd$A10.AF, col=densCols(xnd$A10.BE, xnd$A10.AF))
> abline(0,1,col="red",lwd=2) # add reference line
> ### MA plot
> M = xnd$A10.AF - xnd$A10.BE
> A = (xnd$A10.AF + xnd$A10.BE)/2
> plot(A, M, col=densCols(A,M))
R practice (Visualization)
Normalization for RNAseq
> library(gplots)
> heatmap(xn[1:100,])
> heatmap(xn[1:100,],col=greenred(256))
> heatmap(xn[1:100,],col=bluered(256)) # colorpanel
> heatmap.2(xn[1:100,],col=bluered(256)) # no scaling; with color key
> heatmap.2(xn[1:100,],col=bluered(256),trace="none")
> heatmap.2(xn[1:100,],col=bluered(256),trace="none",labRow=x[1:100,1])
There are two main sources of systematic variability that require
normalization.
(1) RNA fragmentation during library construction causes longer transcripts to
generate more reads compared to shorter transcripts present at the same
abundance in the sample (3&4).
(2) The variability in the number of reads produced for each run causes
fluctuations in the number of fragments mapped across samples (1&2).
Normalization for RNAseq
• Single-end reads: use reads per kilobase
of transcript per million mapped reads
(RPKM) metric
109 x R / (N x L)
• Pair-end reads: use analogous fragments
per kilobase of transcript per million
mapped reads (FPKM) metric
Scaling Method in DESeq
Workflow
DE Analysis
Data import
• In many cases, the purpose of microarray
experiment is to compare the gene expression
levels in two or several predetermined classes.
Preprocessing*
Visualization
DE analysis*
– The comparison is often performed under gene-bygene basis.
– For the convenient interpretability, comparisons
usually ignore the dependencies between genes.
Adjust p-values for
multiple comparisons
Cluster analysis
* Different methods are used for
microarray and RNA-seq data
Fold Change
• Fold change is the important and intuitive
approach to find differentially regulated genes:
DE Analysis for Microarray
Data Type
Expression of Experimental Sample
Fold change (FC) =
Expression of Reference Sample
Paired Data
Unpaired Data
Complex Data
(More than two
groups)
Paired t-test
Two-sample ttest
Analysis of
Variance
(ANOVA)
Parametric Tests
• It may be the only possibility in cases where no
replicates are available.
• The fold change is chosen arbitrarily and cannot
access the level of significance.
Assumptions: Normality; equal-variance
Nonparametric
Tests
Wilcoxon
Wilcoxon ranksigned-rank test sum test
Numerical
methods
Permutation tests
Monte-Carlo permutation tests
Kruskal-Wallis
test
R practice (DE analysis)
> trt = gl(2,20,labels=c("BE","AF"))
> summary(aov(xn[1,]~trt))[[1]]$Pr[1]
• DESeq (DESeq2) is an BioC package:
– Assume the read counts are distributed as
negative binomial (NB) distribution.
> library(lmPerm)
> summary(aovp(xn[1,]~trt))[[1]]$Pr[1] # permutation p-value
1. Estimate the variance for NB distribution
2. Hypothesis testing under NB distribution
# anova (t-test) for each gene:
> p = c()
> for (i in 1:nrow(xn)){
+ p[i] = summary(aov(xn[1,]~trt))[[1]]$Pr[1]
+}
>
DESeq2
• Input from count matrix:
gene
Gene_00001
Gene_00002
Gene_00003
Gene_00004
Gene_00005
Gene_00006
Gene_00007
Gene_00008
Gene_00009
Gene_00010
Gene_00011
Gene_00012
Gene_00013
(18761 genes)
T1a
0
20
3
75
10
129
13
0
202
10
2
104
6
T1b
0
8
0
84
16
126
4
3
122
8
3
60
6
DE Analysis for RNAseq
T2
2
12
2
241
4
451
21
0
256
56
5
218
22
T3
0
5
0
149
0
223
19
0
43
145
0
213
13
DESeq2
ctData.tab
N1
0
19
0
271
4
243
31
0
287
14
3
111
15
N2
1
26
0
257
10
149
4
0
357
15
0
121
6
(6 samples)
> library('DESeq2')
> sampleCountData = read.delim("data/ctData.tab",
+
colClasses=rep(c("character","integer"),c(1,6)),row.names=1)
> sampleCondition =
+
c("treated","treated","treated","treated","untreated","untreated")
> sampleColData = DataFrame(condition=as.factor(sampleCondition),
+
row.names=colnames(sampleCountData))
> dds = DESeqDataSetFromMatrix(countData = sampleCountData,
colData = sampleColData,
design = ~ condition)
> colData(dds)$condition = relevel(colData(dds)$condition, "untreated")
DESeq2
DESeq2
> dds = DESeq(dds)
> res = results(dds)
> res = res[order(res$padj),]
> plotMA(dds)
> write.csv(as.data.frame(res), file="condition_treated_results.csv")
# LRT for mutiple levels
> colData(dds)$condition = as.factor(c("t1","t1","t2","t2","ctrl","ctrl"))
> colData(dds)$condition = relevel(colData(dds)$condition, "ctrl")
> ddsLRT = DESeq(dds,test="LRT", reduced= ~ 1)
> resLRT=results(ddsLRT)
> mcols(ddsLRT,use.names=TRUE)[1:3,]
# when there is no replicate
> trt = c("T1a","T1b")
> dds.short = DESeqDataSetFromMatrix(countData = sampleCountData[,1:2],
+
colData = DataFrame(condition=as.factor(trt), row.names=trt),
+
design = ~ condition)
> dds.short = DESeq(dds.short)
> plotMA(dds.short)
Workflow
Hypothesis Testing in Microarray Study
Data import
• In all of the Microarray datasets, we are
interested in identifying differentially
expressed genes.
Preprocessing*
Visualization
DE analysis*
Adjust p-values for
multiple comparisons
Cluster analysis
• The method would then be applied to
every gene (one gene at a time) on the
microarray in order to identify those genes
that are differentially expressed
 Control of FDR!
* Different methods are used for
microarray and RNA-seq data
FDR
Microarray:
DESeq2: