Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EMBO 2005. Short Tutorial in MADE4 – Ordination of Microarray Data The files you will need for this tutorial are in http://bioinf.ucd.ie/people/aedin/EMBO2005. This document is on that web site too. It’s a good idea to open a copy of this so you can copy (and paste) commands between this document and R. For the first part of this tutorial we will use a subset of the primate fibroblast gene expression from Karaman et al., Genome Research 2004. This study examines 3 groups, human, bonobo and gorilla expression profiles on Affymetrix HG_U95Av2 chips. This dataset contains 46 chips and is available in the Bioconducor library fibroEset (MAS5.0 data), and the web site http:// (raw cel files). But for this tutorial we will look at 11 chips which have been normalised using vsn. For information I have included details of how I normalised these, at the end of the tutorial. Task 1. Initial data Exploration Download the vsn normalised 11 gene expression profiles from the web site. Read this into R. The data are stored as comma separated files, which is readable by Excel. To load in R: data.vsn<- read.csv(“data.vsn.csv”) library(affy) library(made4) overview(data.vsn) The function ord simplifies the running of ordination methods such as principal component, correspondence or non-symmetric correspondence analysis. It provides a wrapper which can call each of these methods in ade4. data.coa <-ord(data.vsn, type=”coa”) Have a look at data.coa. The ordination results are in $ord. The row, column coordinates are in $li and $co. The eigenvalues are in $eig. plot(data.coa) heatplot(data.coa$ord$li, dend=FALSE, lowcol="blue") We have run a Correspondence Analysis, Compare these results to a PCA. data.pca <-ord(data.vsn, type=”pca”) Compare the difference plots from PCA and COA. plotarrays(data.pca) plotarrays(data.pca, graph=”s.var”) plotgenes(data.pca) At this stage.. it would be useful to colour the plots to help our interpretation Task 2: Interpretation – labelling with covariates Although the overview shows that we have 2 major groups within the data, it is very difficult to know what we are looking at. Lets read a text file (tab delimited) with some sample information into R. annt<-read.table("annt.txt", header=TRUE) annt read.table reads in a table as a data.frame. This is just a data matrix with labels. To view the data in the Age column, use the $ symbol before the column label. annt$Age This file contains short names for the samples, information about the Donor (Gorilla, Bonobo, Human), Age (years), gender (male/female), doubling time of the cell lines, and information about whether cells where established from the same cell lines. The column heading are; Cels AG_04659_AS.cel AG_05283_AS.CEL AG_05414_AS.cel AG_11745_AS.cel AG_13927_AS.cel KB_5047_2070_2_AS.CEL KB_5275_2_AS.CEL KB_5828_AS.cel KB_6268_2_AS.cel KB_8025_AS.cel KB_8840_AS.cel short.names AG_04659 AG_05283 AG_05414 AG_11745 AG_13927 KB_5047 KB_5275 KB_5828 KB_6268 KB_8025 KB_8840 Donor Hsa Hsa Hsa Hsa Hsa Ggo Ppa Ppa Ggo Ppa Ggo Age 65 69 73 43 45 19 2 12 19 19 2 Gender M M M F F F M M F M F DT 1.7 1.7 2.3 1.8 2.8 2 2.4 2.7 2 2 2.5 estb.same F F D D - Lets review the overview graph, but this time label it with information about the Donor. overview(data.vsn, label=annt$Donor) This looks better, human are distinct from other primates. But BEFORE we go ahead and search for genes distinguishing these… .CHECK the other co variants (sample info). Do the samples also group by Age, Gender, DT, estb.same ? What do you think of the experimental design? How could it be improved? Also plot the arrays projections from the COA. plotarrays(data.coa, classvec=annt$Donor) plotarrays(data.coa, classvec=annt$Gender) Have a look at the different plots, Task 3: Getting out the Gene Information Do heatmap of top genes from axis 1. If you look at data.ord, you will see the gene co-ordinates are in $co. To get the top genes from axis 1 in the ordination. ax1<- topgenes(data.coa$ord$co, ends="both", axis=1) ax1 You will see that R sometimes has the habit of putting X in front of row names if they start with a number. Hence the names in ax1 don’t agree with data. Its easier to sort this an remove the “X” in the names, ax1<-sub("X", "", ax1) To make a heatmap and dendrogram of these: heatplot(data.vsn[ax1,]) savePlot("heatplot_COA") Task 4: Annotating the plots with gene information GO Annotation: Get the annotation for data (this can be done using either the annaffy or biomaRt packages which will both be demonstrated during the course. For this practical we will use annaffy) To get the Gene Symbol for all genes, and use these in plots library(annaffy) If annaffy is not installed, download annaffy, GO and KEGG from Bioconductor or use nstall.packages2() if reposTools is installed. affy.id <-rownames(data) affy.symbols<-aafSymbol(affy.id, “hgu95av2”) affy.symbols <-getText(affy.symbols) plotgenes(data.coa, varlabels= affy.symbols) To get lots Gene Information for a subset of genes anncols<-aaf.handler() anncols anntable <- aafTableAnn(ax1, "hgu95av2", anncols) saveHTML(anntable, "example1.html", title = "Example") A copy of this output is available on the web page http://bioinf.ucd.ie/people/aedin/EMBO2005. Have a look. Are there any genes of interest to you? Advanced Tasks If you have time, there are extra tasks. If not the code may be useful to you in your own data analysis. Advanced Task 1: 3D Plots Using scatterplot3D library do3d(data.coa$ord$li, classvec=annt$Donor, cex.symbols=3) rotate3d(data.coa$ord$li, classvec=annt$Donor) html3D(data.coa$ord$li, classvec=annt$Donor) html3D produces a plot which can be rotated using chime or jmol. For an example see the course website. Advanced Task 2: Use SAM to select genes significant with a 1% FDA and examine these using ordination and clustering library(siggenes) # SAM of human versus non-humanoid primate sam.c1<-as.character(annt$Donor) sam.c1 sam.c1[!sam.c1==”Hsa”]<-“other primate” sam.c1 # Perform a SAM analysis for the two class unpaired case assuming unequal variances. Perform 100 permutations (B). sam.out<-sam(data.vsn, sam.c1, B=100) sam.out plot(sam.out) savePlot("Sam.plot1") # Select a delta to give a good a 1 %FDR and reasonable number of genes. For example to obtain the SAM plot for Delta = 2 plot(sam.out,2) # Get information about the significant genes at delta =4, use: plot(sam.out, 4) sam.res <-summary(sam.out, 4) names(sam.res) length(sam.res$row.sig.genes) sam.genes<-data.vsn[sam.res$row.sig.genes,] heatplot(t(sam.genes)) plotarrays(ord(sam.genes), classvec=sam.c1) Advanced Task 3: Supervised Ordination – Between Group Analysis For this example, we will look at the Khan et al., (2001) datasets which measured the genes expression of four types of small round blue cell tumours dataset, these are EWS, RMS, BL and NB. This example was described by Culhane et al., 2002 in their paper in which they apply BGA to the class prediction of these tumours. data(khan) names(khan) ?khan khan.bga<-bga(khan$train, classvec=khan$train.classes) khan.bga plot(khan.bga, genelabels=khan$annotation$Symbol) Provide a view of the principal components (axes) of the bga heatplot(khan.bga$bet$ls, dend=FALSE,lowcol="blue") In supervised an analysis such as this it is essential to test the discrimination using cross-validation. To perform jack-knife one leave out cross –validation use bga.jackknife(khan$train, khan$train.classes) To project new test data use khan.supp<- suppl(khan.bga, supdata=khan$test, supvec=khan$test.classes) It will output something like TEST.9 TEST.11 TEST.5 TEST.8 TEST.10 projected.Axis1 projected.Axis2 closest.centre predicted true.class 0.1400 -0.038 RMS RMS Normal -0.1400 -0.170 EWS EWS Normal -0.0600 -0.120 EWS EWS Normal 0.0420 -0.094 NB NB NB 0.4600 0.160 RMS RMS RMS Which gives the class predicted by BGA (predicted), and the closest group centroid in Elucidean space (closest centre). If you supplied a supvec, which gives the true classes of the test data, this will be attached to the output (last column true.class). To perform training and test in one use: khan.bga<-bga.suppl(khan$train, supdata=khan$test, classvec=khan$train.classes) Advanced Task 3: Comparing datasets (meta-analysis) using Coinertia Analysis Coinertia analysis has been applied to the cross-platform comparison (meta-analysis) of microarray gene expression datasets (Culhane et al., 2003). CIA is a multivariate method that identifies trends or co-relationships in multiple datasets which contain the same samples. That is either the rows or the columns of a matrix must be "matchable". CIA can be applied to datasets where the number of variables (genes) far exceeds the number of samples (arrays) such is the case with microarray analyses. cia calls coinertia in the ade4 package. See the CIA vignette for more details on this method. To run this. Lets look at two gene expression datasets of the same 60 cell lines. The NCI60 cells lines are a set of 60 cell lines with different tumour phenotypes (eg Breast, Colon, Leukemia, Prostate, CNS, lung cancer, ovarian, renal cancer etc). The gene expression of these cell lines have been examined by a number of groups. We will compare one from Affymetrix (Staunton et al., 2001) and one that was obtained using Stanford spotted cDNA arrays (Ross et al., 2000) using CIA. data(NCI60) summary(NCI60) names(NCI60) NCI60$classes coin <- cia(NCI60$Ross, NCI60$Affy) names(coin) coin$coinertia$RV The RV coefficient $RV which is 0.78 in this instance, is a measure of “global” similarity between the datasets. The closer to 1, in the scale 0-1 the greater the correlation between the two datasets. To visually examine the cell lines that have similar or different gene expression profiles in these datasets. plotarrays(coin, classvec=NCI60$classes[,2]) plotarrays(coin, classvec=NCI60$classes[,2], lab=NULL, cpoint=3) plot(coin, classvec=NCI60$classes[,2]) For each of the 60 cell lines, there are two coordinates ($coinertia$mX and $coinertia$mY). On the plot, these are visually shown as a closed circle and an arrow. These are joined by a line. If the profiles are similar they will be projected close together in the new space (ie joined by a short line). For more information see Culhane et al., BMC bioinformatics 2003. For information only, please don’t repeat in this today, this is ONLY background information. Vsn normalisation of data. To produce this. I downloaded the cel files, Started R, Used File -> Change directory to select the directory with the cel files in it. Then getwd() dir() library(affy) library(vsn) cels <- list.celfiles() data <- ReadAffy(filenames= cels) normalize.AffyBatch.methods <- c(normalize.AffyBatch.methods, "vsn") data.vsn <- es1 = expresso(data, bg.correct = FALSE, normalize.method = "vsn", pmcorrect.method = "pmonly", summary.method = "medianpolish") exprs2excel(data.vsn, file="data.vsn.csv") A tab delimited text file could also be saved using write.exprs(data.vsn, file="data.rma.txt")