Download MSWord

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Quantitative comparative linguistics wikipedia , lookup

Metagenomics wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
EMBO 2005. Short Tutorial in MADE4 – Ordination of Microarray Data
The files you will need for this tutorial are in
http://bioinf.ucd.ie/people/aedin/EMBO2005.
This document is on that web site too. It’s a good idea to open a copy of this so you
can copy (and paste) commands between this document and R.
For the first part of this tutorial we will use a subset of the primate fibroblast gene
expression from Karaman et al., Genome Research 2004. This study examines 3
groups, human, bonobo and gorilla expression profiles on Affymetrix HG_U95Av2
chips. This dataset contains 46 chips and is available in the Bioconducor library
fibroEset (MAS5.0 data), and the web site http:// (raw cel files).
But for this tutorial we will look at 11 chips which have been normalised using vsn.
For information I have included details of how I normalised these, at the end of the
tutorial.
Task 1. Initial data Exploration
Download the vsn normalised 11 gene expression profiles from the web site. Read
this into R. The data are stored as comma separated files, which is readable by Excel.
To load in R:
data.vsn<- read.csv(“data.vsn.csv”)
library(affy)
library(made4)
overview(data.vsn)
The function ord simplifies the running of ordination methods such as principal
component, correspondence or non-symmetric correspondence analysis. It provides a
wrapper which can call each of these methods in ade4.
data.coa <-ord(data.vsn, type=”coa”)
Have a look at data.coa. The ordination results are in $ord. The row, column
coordinates are in $li and $co. The eigenvalues are in $eig.
plot(data.coa)
heatplot(data.coa$ord$li, dend=FALSE, lowcol="blue")
We have run a Correspondence Analysis, Compare these results to a PCA.
data.pca <-ord(data.vsn, type=”pca”)
Compare the difference plots from PCA and COA.
plotarrays(data.pca)
plotarrays(data.pca, graph=”s.var”)
plotgenes(data.pca)
At this stage.. it would be useful to colour the plots to help our interpretation
Task 2: Interpretation – labelling with covariates
Although the overview shows that we have 2 major groups within the data, it is very
difficult to know what we are looking at. Lets read a text file (tab delimited) with
some sample information into R.
annt<-read.table("annt.txt", header=TRUE)
annt
read.table reads in a table as a data.frame. This is just a data matrix with labels. To
view the data in the Age column, use the $ symbol before the column label.
annt$Age
This file contains short names for the samples, information about the Donor (Gorilla,
Bonobo, Human), Age (years), gender (male/female), doubling time of the cell lines,
and information about whether cells where established from the same cell lines. The
column heading are;
Cels
AG_04659_AS.cel
AG_05283_AS.CEL
AG_05414_AS.cel
AG_11745_AS.cel
AG_13927_AS.cel
KB_5047_2070_2_AS.CEL
KB_5275_2_AS.CEL
KB_5828_AS.cel
KB_6268_2_AS.cel
KB_8025_AS.cel
KB_8840_AS.cel
short.names
AG_04659
AG_05283
AG_05414
AG_11745
AG_13927
KB_5047
KB_5275
KB_5828
KB_6268
KB_8025
KB_8840
Donor
Hsa
Hsa
Hsa
Hsa
Hsa
Ggo
Ppa
Ppa
Ggo
Ppa
Ggo
Age
65
69
73
43
45
19
2
12
19
19
2
Gender
M
M
M
F
F
F
M
M
F
M
F
DT
1.7
1.7
2.3
1.8
2.8
2
2.4
2.7
2
2
2.5
estb.same
F
F
D
D
-
Lets review the overview graph, but this time label it with information about the
Donor.
overview(data.vsn, label=annt$Donor)
This looks better, human are distinct from other primates. But BEFORE we go ahead
and search for genes distinguishing these… .CHECK the other co variants (sample
info).
Do the samples also group by Age, Gender, DT, estb.same ?
What do you think of the experimental design?
How could it be improved?
Also plot the arrays projections from the COA.
plotarrays(data.coa, classvec=annt$Donor)
plotarrays(data.coa, classvec=annt$Gender)
Have a look at the different plots,
Task 3: Getting out the Gene Information
Do heatmap of top genes from axis 1.
If you look at data.ord, you will see the gene co-ordinates are in $co. To get the top
genes from axis 1 in the ordination.
ax1<- topgenes(data.coa$ord$co, ends="both", axis=1)
ax1
You will see that R sometimes has the habit of putting X in front of row names if they
start with a number. Hence the names in ax1 don’t agree with data. Its easier to sort
this an remove the “X” in the names,
ax1<-sub("X", "", ax1)
To make a heatmap and dendrogram of these:
heatplot(data.vsn[ax1,])
savePlot("heatplot_COA")
Task 4: Annotating the plots with gene information
GO Annotation: Get the annotation for data (this can be done using either the annaffy
or biomaRt packages which will both be demonstrated during the course. For this
practical we will use annaffy)
To get the Gene Symbol for all genes, and use these in plots
library(annaffy)
If annaffy is not installed, download annaffy, GO and KEGG from Bioconductor or
use nstall.packages2() if reposTools is installed.
affy.id <-rownames(data)
affy.symbols<-aafSymbol(affy.id, “hgu95av2”)
affy.symbols <-getText(affy.symbols)
plotgenes(data.coa, varlabels= affy.symbols)
To get lots Gene Information for a subset of genes
anncols<-aaf.handler()
anncols
anntable <- aafTableAnn(ax1, "hgu95av2", anncols)
saveHTML(anntable, "example1.html", title = "Example")
A copy of this output is available on the web page
http://bioinf.ucd.ie/people/aedin/EMBO2005.
Have a look. Are there any genes of interest to you?
Advanced Tasks
If you have time, there are extra tasks. If not the code may be useful to you in your
own data analysis.
Advanced Task 1: 3D Plots
Using scatterplot3D library
do3d(data.coa$ord$li, classvec=annt$Donor, cex.symbols=3)
rotate3d(data.coa$ord$li, classvec=annt$Donor)
html3D(data.coa$ord$li, classvec=annt$Donor)
html3D produces a plot which can be rotated using chime or jmol. For an example see
the course website.
Advanced Task 2: Use SAM to select genes significant with a 1% FDA and
examine these using ordination and clustering
library(siggenes)
# SAM of human versus non-humanoid primate
sam.c1<-as.character(annt$Donor)
sam.c1
sam.c1[!sam.c1==”Hsa”]<-“other primate”
sam.c1
# Perform a SAM analysis for the two class unpaired case assuming unequal
variances. Perform 100 permutations (B).
sam.out<-sam(data.vsn, sam.c1, B=100)
sam.out
plot(sam.out)
savePlot("Sam.plot1")
# Select a delta to give a good a 1 %FDR and reasonable number of genes. For
example to obtain the SAM plot for Delta = 2
plot(sam.out,2)
# Get information about the significant genes at delta =4, use:
plot(sam.out, 4)
sam.res <-summary(sam.out, 4)
names(sam.res)
length(sam.res$row.sig.genes)
sam.genes<-data.vsn[sam.res$row.sig.genes,]
heatplot(t(sam.genes))
plotarrays(ord(sam.genes), classvec=sam.c1)
Advanced Task 3: Supervised Ordination – Between Group Analysis
For this example, we will look at the Khan et al., (2001) datasets which measured the
genes expression of four types of small round blue cell tumours dataset, these are
EWS, RMS, BL and NB. This example was described by Culhane et al., 2002 in their
paper in which they apply BGA to the class prediction of these tumours.
data(khan)
names(khan)
?khan
khan.bga<-bga(khan$train, classvec=khan$train.classes)
khan.bga
plot(khan.bga, genelabels=khan$annotation$Symbol)
Provide a view of the principal components (axes) of the bga
heatplot(khan.bga$bet$ls, dend=FALSE,lowcol="blue")
In supervised an analysis such as this it is essential to test the discrimination using
cross-validation. To perform jack-knife one leave out cross –validation use
bga.jackknife(khan$train, khan$train.classes)
To project new test data use
khan.supp<- suppl(khan.bga, supdata=khan$test, supvec=khan$test.classes)
It will output something like
TEST.9
TEST.11
TEST.5
TEST.8
TEST.10
projected.Axis1 projected.Axis2 closest.centre predicted true.class
0.1400
-0.038
RMS
RMS
Normal
-0.1400
-0.170
EWS
EWS
Normal
-0.0600
-0.120
EWS
EWS
Normal
0.0420
-0.094
NB
NB
NB
0.4600
0.160
RMS
RMS
RMS
Which gives the class predicted by BGA (predicted), and the closest group centroid in
Elucidean space (closest centre). If you supplied a supvec, which gives the true
classes of the test data, this will be attached to the output (last column true.class).
To perform training and test in one use:
khan.bga<-bga.suppl(khan$train, supdata=khan$test,
classvec=khan$train.classes)
Advanced Task 3: Comparing datasets (meta-analysis) using Coinertia Analysis
Coinertia analysis has been applied to the cross-platform comparison (meta-analysis)
of microarray gene expression datasets (Culhane et al., 2003). CIA is a multivariate
method that identifies trends or co-relationships in multiple datasets which contain the
same samples. That is either the rows or the columns of a matrix must be
"matchable". CIA can be applied to datasets where the number of variables (genes) far
exceeds the number of samples (arrays) such is the case with microarray analyses. cia
calls coinertia in the ade4 package. See the CIA vignette for more details on this
method.
To run this. Lets look at two gene expression datasets of the same 60 cell lines. The
NCI60 cells lines are a set of 60 cell lines with different tumour phenotypes (eg
Breast, Colon, Leukemia, Prostate, CNS, lung cancer, ovarian, renal cancer etc). The
gene expression of these cell lines have been examined by a number of groups. We
will compare one from Affymetrix (Staunton et al., 2001) and one that was obtained
using Stanford spotted cDNA arrays (Ross et al., 2000) using CIA.
data(NCI60)
summary(NCI60)
names(NCI60)
NCI60$classes
coin <- cia(NCI60$Ross, NCI60$Affy)
names(coin)
coin$coinertia$RV
The RV coefficient $RV which is 0.78 in this instance, is a measure of “global”
similarity between the datasets. The closer to 1, in the scale 0-1 the greater the
correlation between the two datasets.
To visually examine the cell lines that have similar or different gene expression
profiles in these datasets.
plotarrays(coin, classvec=NCI60$classes[,2])
plotarrays(coin, classvec=NCI60$classes[,2], lab=NULL,
cpoint=3)
plot(coin, classvec=NCI60$classes[,2])
For each of the 60 cell lines, there are two coordinates ($coinertia$mX and
$coinertia$mY). On the plot, these are visually shown as a closed circle and an
arrow. These are joined by a line. If the profiles are similar they will be projected
close together in the new space (ie joined by a short line). For more information see
Culhane et al., BMC bioinformatics 2003.
For information only, please don’t repeat in this today, this is ONLY
background information. Vsn normalisation of data.
To produce this. I downloaded the cel files, Started R, Used File -> Change
directory to select the directory with the cel files in it. Then
getwd()
dir()
library(affy)
library(vsn)
cels <- list.celfiles()
data <- ReadAffy(filenames= cels)
normalize.AffyBatch.methods <- c(normalize.AffyBatch.methods, "vsn")
data.vsn <- es1 = expresso(data, bg.correct = FALSE, normalize.method
= "vsn", pmcorrect.method = "pmonly", summary.method
=
"medianpolish")
exprs2excel(data.vsn, file="data.vsn.csv")
A tab delimited text file could also be saved using
write.exprs(data.vsn, file="data.rma.txt")