Download Problems 07

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ridge (biology) wikipedia , lookup

Metagenomics wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Genomic Data Manipulation
BIO508 Spring 2016
Problems 07
Gene Expression
1.
We're going to begin a set of gene expression analysis tasks with something unprecedented - actually using a
problem from the textbook! Well, almost; perhaps a problem inspired by the textbook. Their terse version,
which I find to be a bit incomplete, is problem 11-2, which you can look at for guidance, but we're going to go
into a little more depth here so as to actually produce some useful results.
NB: Yet again, please submit a .zip or .tar.gz file named either problems07.zip or
problems07.tar.gz, containing each of the *files_starred_like_this.txt* and one
problems07.doc, problems07.docx, problems07.txt, or similar file answering any written questions
*starred like this*.
What I like most about the problem is its dataset, so let's get it and do some analysis on it by ourselves. Head
to NCBI GEO at:
http://www.ncbi.nlm.nih.gov/geo/
Find and download the "DataSet SOFT file" (not the "full SOFT file") for dataset ID GDS1761. While you're at
it, also download and install MeV from:
http://sourceforge.net/projects/mev-tm4/files/latest/download
a.
(1) *How many conditions (individual microarray samples) are in this dataset?*
b.
(1) *What platform was used for this microarray dataset? How many channels does it have?*
c.
(0) Unzip the SOFT file and launch MeV. Go to File/Load Data. If you take a look at the "Select File
Loader" menu and GEO submenu, you'll see that MeV offers to directly load GDS files for us. Great!
Select that, click Browse, and open your GDS1761.soft file. Check "Spotted DNA/cDNA Array", then
click Load.
d. (0) Congratulations - you can now see a bunch of red and green spots labeled with inscrutable numbers.
To start with, I'd suggest selecting Display/"Set Element Size"/"10 x 10" for sanity's sake. Wouldn't it be
nice if MeV told us what the names of the rows and columns were? Well, we can fix the former - from the
Display menu, select "Gene/Row Labels" and "Label by IDENTIFIER". Ok, we've got gene names... and if
you try to do the same thing using Display/"Sample/Column Labels"/"Select Sample Label", you'll see
your out of luck. SOFT files by default contain a bunch of useful information about every condition's
metadata, which MeV sadly discards when it opens them. Tell John Quackenbush to support this feature
directly!
e.
(3) Fret not - in the meantime, we can take advantage of MeV's ability to annotate samples with metadata
using a variety of methods, including a user-provided tab-delimited text file. Let's generate this using a
bit of Python magic. We're going to create a file with two columns, the first containing the sample IDs and
the second the cell line types (which are the only metadata included in this particular SOFT). Let's write
(and submit) a script named *soft2annotation.py* that will read the SUBSET definitions from the
SOFT and create out tab-delimited annotation file from them. As usual, connect the dots:
P07-1
#!/usr/bin/env python
import __
import ___
strDesc = astrSamples = None
hashhashAnnotations = __
for strLine in ___._____:
mtch = re.______( r'^!subset_(\S+)\s*=\s*(.+)$', _______ )
if ____:
strType, strValue = ____.groups( )
if _______ == "description":
strDesc = ________
elif _______ == "sample_id":
astrSamples = ________.split( "," )
elif _______ == "type":
hashType = hashhashAnnotations.setdefault( strValue, __ )
for strSample in astrSamples:
hashType[strSample] = strDesc
____ = re.______( r'^ID_REF', _______ )
if ____:
astrLine = ["Sample"] + hashhashAnnotations.keys( )
print( ____.join( astrLine ) )
astrLine = strLine._____( )._____( "\t" )
for strSample in astrLine[2:]:
astrLine = [strSample]
for strType, hashType in hashhashAnnotations._____( ):
astrLine.______( hashType.___( strSample, "" ) )
print( "\t".____( ________ ) )
break
f.
(2) Generate and submit your *GDS1761_annotations.txt* file by running the command
soft2annotation.py < GDS1761.soft > GDS1761_annotations.txt. Attach these to your
data in MeV using Utilities/"Append Sample Annotation" and selecting this file. You can now do two
nifty things; first, choose Display/"Sample/Column Labels"/"Select Sample Label"/"Label by cell line" to
get some meaningful IDs on these columns. Then, go to Utilities/"Cluster Utilities"/"Automatic Cluster
Import"/"By Sample Annotation" and add "cell line" to the right panel before clicking OK. You'll see why
in just a second...
g.
(0) Let's perform some unsupervised analyses using these data. First, to make everything a bit speedier,
let's filter out most of the genes. Select "Adjust Data"/"Data Filters"/"Variance Filter". Tell MeV to retain
only the 1,000 highest variance genes. Note that this isn't really good microarray analysis practice, but it'll
make your homework go a lot faster, so bear with me. After clicking OK, find "Data Filter - Variance
Filter" under "Analysis Results" in the left panel, click the little hook to expand it, right click "Expression
Image", and select "Set as Data Source".
h. (1) The first thing you should do is look at your data! Let's do that by a simple clustering. From
"Clustering", select "Hierarchical Clustering". Leave both "Gene Tree" and "Sample Tree" checked, also
check both "Optimize ... Order" boxes, and choose a distance metric and linkage method. *Which metric
and method did you chose? Why?*
i.
(2) Click OK to run the clustering. Wait a few seconds, then expand the HCL results in the left pane.
Click on the tree. Nifty! Those pretty colors are one of the reasons we attached those sample IDs, by the
way, and we'll see more later. For now, save and submit the expression image using File/"Save Image";
make sure to select type PNG and call the file *gds1761_hierarchical.png*.
P07-2
j.
(1) Now scroll around inside of MeV and *tell me something interesting about these data.* This will
likely be easier if you zoom in and out using "Set Element Size" as needed.
k.
(1) The figure of merit is one of several measures of cluster quality that uses intra-cluster versus intercluster variation to decide where the cost/benefit ratio of more clusters stops helping you organize your
data. Run FOM on these samples in order to perform class discovery using Clustering/"Figure of Merit".
Submit this image as *fom.png*.
l.
(3) You want sample clusters using medians, and the rest of the default settings are fine. Looking at the
resulting FOM graph in the left panel, *how many clusters of samples appear to be supported by these
data*, making sure to explain your reasoning?
m. (2) Let's actually make this many clusters using k-means, from Clustering/"k-Means/Medians
Clustering". Cluster the samples using the k you just chose, your favorite distance metric, and medians
rather than means. *Which distance metric did you choose and why? Why are we using medians rather
than means?*
n. (2) Take a look at the results in the left panel using the "Expression Images". *Which tumor types do
these clusters appear to correspond with? Does k-means do a particularly good or bad job organizing
these samples into clusters that correspond to known tumor types?*
o.
(2) Under
"Expression
Graphs",
*gds1761_kmeans.png*.
select
"All
Clusters"
and
File/"Save
Image"
as
p. (4 )
Use the Classification/"KNN" to classify the single unknown tumor sample in these data. You
should use eight classes and bin the ovarian and prostate tumors together (default settings are fine for
classifying these samples). *Into which tumor type class does the unknown sample fall using this
method?*
2.
A great deal of microarray bioinformatics are performed using R and Bioconductor, which also provide a
wide array of excellent methods for genomic data manipulation in general. I won't ask you to learn R just for
this problem, however, and will instead provide a pre-processed file appropriate for use with MeV,
gcrma.pcl, in the data at:
http://huttenhower.sph.harvard.edu/moodle/mod/resource/view.php?id=1106
PCL files containing well-normalized sets of diverse microarray conditions can be created without too many
tears from raw CEL (and SOFT/other) files from GEO or other databases using R. In particular, see the book's
problem 11-1 for a guide to how this file was generated from the human cerebellum expression data at
http://bioinfbook.org/php/C11E3k (GEO dataset GSE9762).
a.
(0) First, I've been utterly unable to get MeV to correctly load annotations for this file using "Append
Sample Annotation" like we did before (another one to tell John Q!), so we're going to edit it very slightly
by hand first. Open it up in your favorite text editor (jEdit?) and, just below the first row of headers,
copy/paste in the four lines of text from the gcrma_annotations.txt file provided with the data
above.
b.
(0) Now, open this up in MeV (you can close the existing window and create a new one using File/"New
Multiple Array Viewer" on the main MeV window bar at the top). When you File/"Load Data" this time,
use the standard tab-delimited expression file loader. Browse to your PCL file, uncheck "Load
Annotation" (which is for genes whose IDs haven't been converted yet - R took care of that for this file),
P07-3
and click OK. NB: Since you've left "Single-color Array" checked, MeV will automatically do some magic
to normalize the rows of this matrix. You don't want this to happen for any data other than a legitimate
log-transformed single-channel array. In most other cases, you'd want to select "Two-color array" instead,
even if your data isn't a two-channel array per se (it just keeps MeV from messing with it). In this case...
carry on.
c.
(0) Ok, pretty colors again. Before proceeding and changing the data source, also go to "Adjust
Data"/"Gene/Row Adjustments"/"Normalize Genes/Rows" since we're processing a single channel array.
Rinse and repeat essentially as above: filter to the top 2,000 variance genes this time, select the data
source, and normalize again if necessary. Wait - why did everything just change colors? Normalizing the
rows z-scores them, so we've taken values that were all positive and made them range in a normal
distribution around zero. To view this appropriately, go to Display/"Set Color Scale Limits" and make the
lower, midpoint, and upper values -3, 0, and 3, respectively. Aha - much prettier!
d. (3) To show you've gotten this far, first import clusters again as in problem 1f using the "Class" sample
annotation. Then hierarchically cluster and save the image as *gcrma_hierarchical.png*.
e.
(2) We have a clear two-factor design in this experiment - two tissue types, two phenotypes - which will
allow is to do a nice supervised analysis to see which genes distinguish tissues and phenotypes. It should
be pretty clear from glancing at the hierarchical clustering which of the two factors has the greater effect
size, but let's quickly confirm this first using PCA. Select "Data Reduction"/"Principal Component
Analysis", cluster the samples using the default settings, and look at the projections on the 2D
components 1,2. First, right-click the 1,2 view, enable "Show sample names" and "Larger point size", and
save this image as *gcrma_pc12.png*.
f.
(2) Looking at the labels and keeping in mind that the horizontal axis is the first principal component,
that is, the axis of greatest variation, *which of the two experimental factors has by far the biggest effect
on gene expression?*
g.
(2 )
it?*
*Do you notice anything unusual about one sample? If so, what do you think might be causing
h. (0) PCA is unsupervised - it just tells us which samples are varying, not why in terms of significant genes.
Let's use our prior knowledge about sample types. We'll run two different tests that both take the two
factors into account; first, a simple ANOVA. Choose Statistics/"Two-factor ANOVA", call factor A
"Tissue" with two levels, factor B "Phenotype" with two levels, click Next, and under "Group
Assignment", use your "Class" groups to bin the samples into tissues on the left (1/2/1/2) and phenotypes
on the right (1/1/2/2). Never trust parametric distributions when you have a choice, so select "p-values
based on permutation", and check "Construct Hierarchical Trees for" so we can get pretty pictures. Leave
the rest of the settings on their defaults and click OK.
i.
(6) Grab some coffee - this takes a minute or so - and then look at the three "Hierarchical Trees" results.
Save these three images as *gcrma_anova-tissue.png*, *gcrma_anova-phenotype.png*, and
*gcrma_anova-interaction.png*.
j.
(2) *Explain in plain English what ANOVA tests to determine which genes are significant for each of
these three sets, particularly the interaction set.*
k.
(3 )
*Tell me something cool about the genes in the "phenotype" or "interaction" sets.* This one's
repeatable, three points per legitimate factoid.
P07-4
l.
(
) If you happen to have R installed alongside MeV, you can use use a slightly fancier model for
differential expression called limma, launched using Statistics/"Linear Models for Microarray Data". This
is again a two-factor design, with factor A called "Tissue" with two levels and factor B called "Phenotype"
with two levels, and you should enable "Construct Hierarchical Trees for" and hit Continue. Assign your
groups again (same deal, 1/2/1/2 and 1/1/2/2) and hit OK. limma's much speedier, so check out the
hierarchical trees again. Wait, where'd they go!? Check out the "Expression Images": limma only provides
hierarchical trees for genes significant in exactly one test, and all of ours show an interaction. This means
they appear in the "Significant Genes" image, which isn't hierarchically clustered by default. So...
m. (4
) ...right click on it, choose "Set as data source", and manually run hierarchical clustering from
Clustering. Select the resulting image and save it as *gcrma_limma.png*. Note that it contains
essentially all the same types of tissue/phenotype interactions as we found using ANOVA, just organized
differently because limma's, well, different.
n. (2
o.
) *Why shouldn't we use SAM (as implemented in MeV) for these data?*
(8
) GSEA is a great method for digesting these long lists of genes that are differential by tissue or
phenotype down into pathways with specific names. For the life of me, I can't get it to work in MeV. If
you can, or if you take the time to run it either through the GenePattern web site or through the
downloadable GSEA tool, *let me know and submit the output files.* I'm very curious what it says
about these data!
P07-5