Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Genomic Data Manipulation BIO508 Spring 2016 Problems 07 Gene Expression 1. We're going to begin a set of gene expression analysis tasks with something unprecedented - actually using a problem from the textbook! Well, almost; perhaps a problem inspired by the textbook. Their terse version, which I find to be a bit incomplete, is problem 11-2, which you can look at for guidance, but we're going to go into a little more depth here so as to actually produce some useful results. NB: Yet again, please submit a .zip or .tar.gz file named either problems07.zip or problems07.tar.gz, containing each of the *files_starred_like_this.txt* and one problems07.doc, problems07.docx, problems07.txt, or similar file answering any written questions *starred like this*. What I like most about the problem is its dataset, so let's get it and do some analysis on it by ourselves. Head to NCBI GEO at: http://www.ncbi.nlm.nih.gov/geo/ Find and download the "DataSet SOFT file" (not the "full SOFT file") for dataset ID GDS1761. While you're at it, also download and install MeV from: http://sourceforge.net/projects/mev-tm4/files/latest/download a. (1) *How many conditions (individual microarray samples) are in this dataset?* b. (1) *What platform was used for this microarray dataset? How many channels does it have?* c. (0) Unzip the SOFT file and launch MeV. Go to File/Load Data. If you take a look at the "Select File Loader" menu and GEO submenu, you'll see that MeV offers to directly load GDS files for us. Great! Select that, click Browse, and open your GDS1761.soft file. Check "Spotted DNA/cDNA Array", then click Load. d. (0) Congratulations - you can now see a bunch of red and green spots labeled with inscrutable numbers. To start with, I'd suggest selecting Display/"Set Element Size"/"10 x 10" for sanity's sake. Wouldn't it be nice if MeV told us what the names of the rows and columns were? Well, we can fix the former - from the Display menu, select "Gene/Row Labels" and "Label by IDENTIFIER". Ok, we've got gene names... and if you try to do the same thing using Display/"Sample/Column Labels"/"Select Sample Label", you'll see your out of luck. SOFT files by default contain a bunch of useful information about every condition's metadata, which MeV sadly discards when it opens them. Tell John Quackenbush to support this feature directly! e. (3) Fret not - in the meantime, we can take advantage of MeV's ability to annotate samples with metadata using a variety of methods, including a user-provided tab-delimited text file. Let's generate this using a bit of Python magic. We're going to create a file with two columns, the first containing the sample IDs and the second the cell line types (which are the only metadata included in this particular SOFT). Let's write (and submit) a script named *soft2annotation.py* that will read the SUBSET definitions from the SOFT and create out tab-delimited annotation file from them. As usual, connect the dots: P07-1 #!/usr/bin/env python import __ import ___ strDesc = astrSamples = None hashhashAnnotations = __ for strLine in ___._____: mtch = re.______( r'^!subset_(\S+)\s*=\s*(.+)$', _______ ) if ____: strType, strValue = ____.groups( ) if _______ == "description": strDesc = ________ elif _______ == "sample_id": astrSamples = ________.split( "," ) elif _______ == "type": hashType = hashhashAnnotations.setdefault( strValue, __ ) for strSample in astrSamples: hashType[strSample] = strDesc ____ = re.______( r'^ID_REF', _______ ) if ____: astrLine = ["Sample"] + hashhashAnnotations.keys( ) print( ____.join( astrLine ) ) astrLine = strLine._____( )._____( "\t" ) for strSample in astrLine[2:]: astrLine = [strSample] for strType, hashType in hashhashAnnotations._____( ): astrLine.______( hashType.___( strSample, "" ) ) print( "\t".____( ________ ) ) break f. (2) Generate and submit your *GDS1761_annotations.txt* file by running the command soft2annotation.py < GDS1761.soft > GDS1761_annotations.txt. Attach these to your data in MeV using Utilities/"Append Sample Annotation" and selecting this file. You can now do two nifty things; first, choose Display/"Sample/Column Labels"/"Select Sample Label"/"Label by cell line" to get some meaningful IDs on these columns. Then, go to Utilities/"Cluster Utilities"/"Automatic Cluster Import"/"By Sample Annotation" and add "cell line" to the right panel before clicking OK. You'll see why in just a second... g. (0) Let's perform some unsupervised analyses using these data. First, to make everything a bit speedier, let's filter out most of the genes. Select "Adjust Data"/"Data Filters"/"Variance Filter". Tell MeV to retain only the 1,000 highest variance genes. Note that this isn't really good microarray analysis practice, but it'll make your homework go a lot faster, so bear with me. After clicking OK, find "Data Filter - Variance Filter" under "Analysis Results" in the left panel, click the little hook to expand it, right click "Expression Image", and select "Set as Data Source". h. (1) The first thing you should do is look at your data! Let's do that by a simple clustering. From "Clustering", select "Hierarchical Clustering". Leave both "Gene Tree" and "Sample Tree" checked, also check both "Optimize ... Order" boxes, and choose a distance metric and linkage method. *Which metric and method did you chose? Why?* i. (2) Click OK to run the clustering. Wait a few seconds, then expand the HCL results in the left pane. Click on the tree. Nifty! Those pretty colors are one of the reasons we attached those sample IDs, by the way, and we'll see more later. For now, save and submit the expression image using File/"Save Image"; make sure to select type PNG and call the file *gds1761_hierarchical.png*. P07-2 j. (1) Now scroll around inside of MeV and *tell me something interesting about these data.* This will likely be easier if you zoom in and out using "Set Element Size" as needed. k. (1) The figure of merit is one of several measures of cluster quality that uses intra-cluster versus intercluster variation to decide where the cost/benefit ratio of more clusters stops helping you organize your data. Run FOM on these samples in order to perform class discovery using Clustering/"Figure of Merit". Submit this image as *fom.png*. l. (3) You want sample clusters using medians, and the rest of the default settings are fine. Looking at the resulting FOM graph in the left panel, *how many clusters of samples appear to be supported by these data*, making sure to explain your reasoning? m. (2) Let's actually make this many clusters using k-means, from Clustering/"k-Means/Medians Clustering". Cluster the samples using the k you just chose, your favorite distance metric, and medians rather than means. *Which distance metric did you choose and why? Why are we using medians rather than means?* n. (2) Take a look at the results in the left panel using the "Expression Images". *Which tumor types do these clusters appear to correspond with? Does k-means do a particularly good or bad job organizing these samples into clusters that correspond to known tumor types?* o. (2) Under "Expression Graphs", *gds1761_kmeans.png*. select "All Clusters" and File/"Save Image" as p. (4 ) Use the Classification/"KNN" to classify the single unknown tumor sample in these data. You should use eight classes and bin the ovarian and prostate tumors together (default settings are fine for classifying these samples). *Into which tumor type class does the unknown sample fall using this method?* 2. A great deal of microarray bioinformatics are performed using R and Bioconductor, which also provide a wide array of excellent methods for genomic data manipulation in general. I won't ask you to learn R just for this problem, however, and will instead provide a pre-processed file appropriate for use with MeV, gcrma.pcl, in the data at: http://huttenhower.sph.harvard.edu/moodle/mod/resource/view.php?id=1106 PCL files containing well-normalized sets of diverse microarray conditions can be created without too many tears from raw CEL (and SOFT/other) files from GEO or other databases using R. In particular, see the book's problem 11-1 for a guide to how this file was generated from the human cerebellum expression data at http://bioinfbook.org/php/C11E3k (GEO dataset GSE9762). a. (0) First, I've been utterly unable to get MeV to correctly load annotations for this file using "Append Sample Annotation" like we did before (another one to tell John Q!), so we're going to edit it very slightly by hand first. Open it up in your favorite text editor (jEdit?) and, just below the first row of headers, copy/paste in the four lines of text from the gcrma_annotations.txt file provided with the data above. b. (0) Now, open this up in MeV (you can close the existing window and create a new one using File/"New Multiple Array Viewer" on the main MeV window bar at the top). When you File/"Load Data" this time, use the standard tab-delimited expression file loader. Browse to your PCL file, uncheck "Load Annotation" (which is for genes whose IDs haven't been converted yet - R took care of that for this file), P07-3 and click OK. NB: Since you've left "Single-color Array" checked, MeV will automatically do some magic to normalize the rows of this matrix. You don't want this to happen for any data other than a legitimate log-transformed single-channel array. In most other cases, you'd want to select "Two-color array" instead, even if your data isn't a two-channel array per se (it just keeps MeV from messing with it). In this case... carry on. c. (0) Ok, pretty colors again. Before proceeding and changing the data source, also go to "Adjust Data"/"Gene/Row Adjustments"/"Normalize Genes/Rows" since we're processing a single channel array. Rinse and repeat essentially as above: filter to the top 2,000 variance genes this time, select the data source, and normalize again if necessary. Wait - why did everything just change colors? Normalizing the rows z-scores them, so we've taken values that were all positive and made them range in a normal distribution around zero. To view this appropriately, go to Display/"Set Color Scale Limits" and make the lower, midpoint, and upper values -3, 0, and 3, respectively. Aha - much prettier! d. (3) To show you've gotten this far, first import clusters again as in problem 1f using the "Class" sample annotation. Then hierarchically cluster and save the image as *gcrma_hierarchical.png*. e. (2) We have a clear two-factor design in this experiment - two tissue types, two phenotypes - which will allow is to do a nice supervised analysis to see which genes distinguish tissues and phenotypes. It should be pretty clear from glancing at the hierarchical clustering which of the two factors has the greater effect size, but let's quickly confirm this first using PCA. Select "Data Reduction"/"Principal Component Analysis", cluster the samples using the default settings, and look at the projections on the 2D components 1,2. First, right-click the 1,2 view, enable "Show sample names" and "Larger point size", and save this image as *gcrma_pc12.png*. f. (2) Looking at the labels and keeping in mind that the horizontal axis is the first principal component, that is, the axis of greatest variation, *which of the two experimental factors has by far the biggest effect on gene expression?* g. (2 ) it?* *Do you notice anything unusual about one sample? If so, what do you think might be causing h. (0) PCA is unsupervised - it just tells us which samples are varying, not why in terms of significant genes. Let's use our prior knowledge about sample types. We'll run two different tests that both take the two factors into account; first, a simple ANOVA. Choose Statistics/"Two-factor ANOVA", call factor A "Tissue" with two levels, factor B "Phenotype" with two levels, click Next, and under "Group Assignment", use your "Class" groups to bin the samples into tissues on the left (1/2/1/2) and phenotypes on the right (1/1/2/2). Never trust parametric distributions when you have a choice, so select "p-values based on permutation", and check "Construct Hierarchical Trees for" so we can get pretty pictures. Leave the rest of the settings on their defaults and click OK. i. (6) Grab some coffee - this takes a minute or so - and then look at the three "Hierarchical Trees" results. Save these three images as *gcrma_anova-tissue.png*, *gcrma_anova-phenotype.png*, and *gcrma_anova-interaction.png*. j. (2) *Explain in plain English what ANOVA tests to determine which genes are significant for each of these three sets, particularly the interaction set.* k. (3 ) *Tell me something cool about the genes in the "phenotype" or "interaction" sets.* This one's repeatable, three points per legitimate factoid. P07-4 l. ( ) If you happen to have R installed alongside MeV, you can use use a slightly fancier model for differential expression called limma, launched using Statistics/"Linear Models for Microarray Data". This is again a two-factor design, with factor A called "Tissue" with two levels and factor B called "Phenotype" with two levels, and you should enable "Construct Hierarchical Trees for" and hit Continue. Assign your groups again (same deal, 1/2/1/2 and 1/1/2/2) and hit OK. limma's much speedier, so check out the hierarchical trees again. Wait, where'd they go!? Check out the "Expression Images": limma only provides hierarchical trees for genes significant in exactly one test, and all of ours show an interaction. This means they appear in the "Significant Genes" image, which isn't hierarchically clustered by default. So... m. (4 ) ...right click on it, choose "Set as data source", and manually run hierarchical clustering from Clustering. Select the resulting image and save it as *gcrma_limma.png*. Note that it contains essentially all the same types of tissue/phenotype interactions as we found using ANOVA, just organized differently because limma's, well, different. n. (2 o. ) *Why shouldn't we use SAM (as implemented in MeV) for these data?* (8 ) GSEA is a great method for digesting these long lists of genes that are differential by tissue or phenotype down into pathways with specific names. For the life of me, I can't get it to work in MeV. If you can, or if you take the time to run it either through the GenePattern web site or through the downloadable GSEA tool, *let me know and submit the output files.* I'm very curious what it says about these data! P07-5