Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Genomic imprinting wikipedia , lookup
Metagenomics wikipedia , lookup
Designer baby wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Minimal genome wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Clustering Intheseexercises,youwillworkwithanRNAseqdatasetcontainingreadcounts fornearly40,000genes,measuredincorn(Zeamays)seeds2,4,6,8,10,12,14, 16, 18, 20, 22 and 24 days after pollination (see clustering.zip on the website). On each day, two or three replicates have been measured, yielding a totalof32datavectors.First,youwillselectasubsetofthedataandinspectit; then cluster the data using hierarchical clustering and K-means clustering; and finallyinspectoneoftheclustersyoufoundforenrichedfunctionalannotations. 1. Dataselectionandinspection • StartRStudio. • LoadthedatafileWholeSeeds.rdata: Load("WholeSeeds.rdata") • • • • • andinspectthevariables(Y, genes,dap,repl)inyourenvironment window.Whatvariabledoyouthinkcontainswhatinformation? Selectanumberofinterestinggenes,e.g.1,000: # Calculate MAD for each gene m = apply(Y,1,mad) # Sort genes by decreasing MAD ind = order(m,decreasing = TRUE,na.last = NA) data = Y[ind[1:1000],] genes = genes[ind[1:1000]] Canyoufindoutwhatdefinitionof“interesting”isusedhere?(hint: help(mad)). Inspectthedatausingimage(data),image(log(data)), image(cor(t(data)))andimage(cor(data)).Notethat t()transposesthedatamatrix,allowingyoutoselectgenesresp. experimentsascolumns.Whatdoyounoticeinthefigures? Inspectsomegeneprofiles,e.g.using(forgeneg): plot(dap[order(dap)],data[g,order(dap)],type='p') Optional:inspectthedata(bothgenesandsamples)usingprincipal componentanalysis(PCA,functionprcomp(…)).Havealookatthehelpto learnhowtoplotyourdataprojectedonthefirsttwoprincipalcomponents. Optional:performalineardiscriminantanalysis(LDA,functionlda(…)in theMASSlibrary)andinspecttheoutcome.Usethetransposeddataanddap asgroupingvariable. 2. Hierarchicalclustering • CalculatetheEuclideandistancematrixbetweenthegenesinthedata: D <- dist(data) • Usethistocalculateahierarchicalclusteringwithcompletelinkage: hc <- hclust(D) • Visualisetheclusteringandhighlight5clusters: plot(hc,labels = FALSE) rect.hclust(hc,k=5,border = "red") • • Whatdoyouthinkoftheresultingclustering? ReplacetheEuclideandistancebyacorrelation-baseddistance,using D <- as.dist(1-cor(t(data)))andrepeatthehierarchicalclustering above.Dotheclusteringsimprove? Varythelinkage(try“single”and“average”)andseewhetherthe dendrogramschange: hc <- hclust(D,method = "single") plot(hc,labels = FALSE) hc <- hclust(D,method = "average") plot(hc,labels = FALSE) • • Trytoexplaintheresultsyouget. Repeattheabovetoclusterexperimentsratherthangenes: D <- dist(t(data)) hc <- hclust(D) plot(hc) Cutthegenedendrogramintok(e.g.5)clusters,selectoneofthese(ofa reasonablesize),saygroupm,andwritethegenesitcontainstoafileas follows: • groups <- cutree(hc,k) write.table(genes[groups==m], file = "hc.txt",quote = FALSE, row.names = FALSE,col.names = FALSE) whereyoureplacekandmbyyourchosenvalues.Pleasesavethefile“hc.txt” inasafeplace(suchasyourdesktop),asyouwillneeditforfurtheranalysis tomorrow. Optional:visualizedataandclusteringsasaheatmap,usingheatmap(). 3. K-meansclustering • Calculatethetotalwithin-scatterwhenusingk=2…15clusters: wss = c() for (i in 2:15) wss[i] <- kmeans(data,centers=i)$tot.withinss andplotitasafunctionofk: plot(1:15,wss) Whatdoyouthinktheoptimalnumberofclustersis? • Reruntheabovecommandsafewtimes.Doyougetthesameresultseach time?Why(not)? • Clusterthedataintok(e.g.5)clusters,selectoneofthese(ofareasonable size),saym,andwritethegenesitcontainstoafile(alsoinasafeplace): km <- kmeans(data,k) write.table(genes[km$cluster==m], file = "km.txt",quote = FALSE, row.names = FALSE,col.names = FALSE) whereyoureplacekandmbythenumbersyouchose.