Download 2c Clustering lab

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomic imprinting wikipedia , lookup

Metagenomics wikipedia , lookup

Designer baby wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Minimal genome wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

RNA-Seq wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Clustering
Intheseexercises,youwillworkwithanRNAseqdatasetcontainingreadcounts
fornearly40,000genes,measuredincorn(Zeamays)seeds2,4,6,8,10,12,14,
16, 18, 20, 22 and 24 days after pollination (see clustering.zip on the
website). On each day, two or three replicates have been measured, yielding a
totalof32datavectors.First,youwillselectasubsetofthedataandinspectit;
then cluster the data using hierarchical clustering and K-means clustering; and
finallyinspectoneoftheclustersyoufoundforenrichedfunctionalannotations.
1. Dataselectionandinspection
• StartRStudio.
• LoadthedatafileWholeSeeds.rdata:
Load("WholeSeeds.rdata")
•
•
•
•
•
andinspectthevariables(Y, genes,dap,repl)inyourenvironment
window.Whatvariabledoyouthinkcontainswhatinformation?
Selectanumberofinterestinggenes,e.g.1,000:
# Calculate MAD for each gene
m = apply(Y,1,mad)
# Sort genes by decreasing MAD
ind = order(m,decreasing = TRUE,na.last = NA)
data = Y[ind[1:1000],]
genes = genes[ind[1:1000]]
Canyoufindoutwhatdefinitionof“interesting”isusedhere?(hint:
help(mad)).
Inspectthedatausingimage(data),image(log(data)),
image(cor(t(data)))andimage(cor(data)).Notethat
t()transposesthedatamatrix,allowingyoutoselectgenesresp.
experimentsascolumns.Whatdoyounoticeinthefigures?
Inspectsomegeneprofiles,e.g.using(forgeneg):
plot(dap[order(dap)],data[g,order(dap)],type='p')
Optional:inspectthedata(bothgenesandsamples)usingprincipal
componentanalysis(PCA,functionprcomp(…)).Havealookatthehelpto
learnhowtoplotyourdataprojectedonthefirsttwoprincipalcomponents.
Optional:performalineardiscriminantanalysis(LDA,functionlda(…)in
theMASSlibrary)andinspecttheoutcome.Usethetransposeddataanddap
asgroupingvariable.
2. Hierarchicalclustering
• CalculatetheEuclideandistancematrixbetweenthegenesinthedata:
D <- dist(data)
• Usethistocalculateahierarchicalclusteringwithcompletelinkage:
hc <- hclust(D)
•
Visualisetheclusteringandhighlight5clusters:
plot(hc,labels = FALSE)
rect.hclust(hc,k=5,border = "red")
•
•
Whatdoyouthinkoftheresultingclustering?
ReplacetheEuclideandistancebyacorrelation-baseddistance,using
D <- as.dist(1-cor(t(data)))andrepeatthehierarchicalclustering
above.Dotheclusteringsimprove?
Varythelinkage(try“single”and“average”)andseewhetherthe
dendrogramschange:
hc <- hclust(D,method = "single")
plot(hc,labels = FALSE)
hc <- hclust(D,method = "average")
plot(hc,labels = FALSE)
•
•
Trytoexplaintheresultsyouget.
Repeattheabovetoclusterexperimentsratherthangenes:
D <- dist(t(data))
hc <- hclust(D)
plot(hc)
Cutthegenedendrogramintok(e.g.5)clusters,selectoneofthese(ofa
reasonablesize),saygroupm,andwritethegenesitcontainstoafileas
follows:
•
groups <- cutree(hc,k)
write.table(genes[groups==m],
file = "hc.txt",quote = FALSE,
row.names = FALSE,col.names = FALSE)
whereyoureplacekandmbyyourchosenvalues.Pleasesavethefile“hc.txt”
inasafeplace(suchasyourdesktop),asyouwillneeditforfurtheranalysis
tomorrow.
Optional:visualizedataandclusteringsasaheatmap,usingheatmap().
3. K-meansclustering
• Calculatethetotalwithin-scatterwhenusingk=2…15clusters:
wss = c()
for (i in 2:15)
wss[i] <- kmeans(data,centers=i)$tot.withinss
andplotitasafunctionofk:
plot(1:15,wss)
Whatdoyouthinktheoptimalnumberofclustersis?
• Reruntheabovecommandsafewtimes.Doyougetthesameresultseach
time?Why(not)?
• Clusterthedataintok(e.g.5)clusters,selectoneofthese(ofareasonable
size),saym,andwritethegenesitcontainstoafile(alsoinasafeplace):
km <- kmeans(data,k)
write.table(genes[km$cluster==m],
file = "km.txt",quote = FALSE,
row.names = FALSE,col.names = FALSE)
whereyoureplacekandmbythenumbersyouchose.