Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Genomic Data Manipulation BIO508 Spring 2014 Problems 01 Quantitative methods 1. (0) If you learn one thing from this class, it should be how to use a computer as a research tool. According to Bill Bialek of Princeton, the best way to do research is to ask someone who already knows; for almost every topic we'll cover, Google already knows. If you see a word you don't recognize, look it up. Wikipedia is surprisingly accurate for statistics and computer science, and it's worth learning to use PubMed efficiently. 2. (4) Find a publication in a journal with impact factor >3.0 that uses both descriptive and inferential statistical methods to interpret biological data. The paper should focus primarily on biology, not methodology. Provide the citation and one short paragraph highlighting A) what you consider the most important points made using quantitative methods, B) a description of what those methods entailed, and C) why you like (or dislike) the paper. 3. Download one of the gene expression datasets available at: http://growthrate.princeton.edu There are several included in the publications listed on the site, formatted as tab-delimited text files or PCL files (see https://genome.unc.edu/MicroArray/help/formats.shtml#pcl), both of which can be opened as plain text using Excel or OpenOffice/LibreOffice (either one of which you can use to complete this problem). Each column represents genome-wide transcript abundances for a single microarray (experimental condition). Please note that this is not the case for the "raw" data - you can choose any of the files that look like this (see #6 for a specific suggestion): a. b. c. (1) Specify which dataset you chose and how many conditions it contains. (2) How many genes (rows) in each condition are further than two standard deviations from the column mean? (3) How many are above the upper inner fence or below the LIF? P01-1 d. (3 ) Using means, medians, standard deviations, and/or upper/lower fences, can you devise a quickand-dirty test for whether each column is approximately normally distributed? How many columns aren't? 4. (4) Using the data you downloaded above, generate any three of a bar chart, histogram, density plot, cumulative distribution plot, stripchart, box plot, or scatter plot. You can use any software you'd like possibilities include Excel, OpenOffice/LibreOffice, Python, R, Matlab, Octave, Stata, SPSS, Gnuplot, Matplotlib, Scilab, GraphPad, Gnumeric, Numbers, SciDAVis, Orange, RapidMiner, MeV, or about a half dozen others I'm sure I haven't thought of. Most of those are free, so no worries if you don't like commercial software. 5. Differential expression (also known as biomarker discovery or class comparison) is one of the most abused microarray analyses, second only to clustering (more on that later in the course). Fortunately, every test for differential expression boils down at some level to a glorified t-test. So let's glorify some t-tests: a. (1) Using Excel, OpenOffice, or your software of choice, determine whether the first gene in your dataset is significantly differentially expressed between the first 1/3 of the conditions and the latter 2/3. Assume that the gene's expression values are normally distributed with equal variance, making a t-test appropriate. Specify which dataset you used, which gene you tested, what formula and parameters you used, and why. Note that this should require a single formula in a single cell to compute (i.e. if it seems complicated, you're doing something wrong). b. (2) Now determine how many genes in the genome are significantly upregulated in the first 1/3 of the conditions relative to the latter 2/3. Specify what formula and parameters you used, whether they were different from part c, if so why, and how many genes were significant at p=0.05. c. (1) How many genes are in this particular genome? How many would you expect to be differentially expressed by chance using this test and significance threshold? d. (2) Let's discard our normality assumption and test for differential expression nonparametrically. Go back to your single-gene test from part a. Copy-paste your data from Excel/OpenOffice (you may have to transpose it first using Edit/Paste Special) into the online Mann-Whitney Wilcoxon calculator at: http://www.fon.hum.uva.nl/Service/Statistics/Wilcoxon_Test.html e. 6. What p-value for differential expression do you obtain? Under what circumstances might a t-test be more appropriate than a MWW, or vice versa? (3 ) Are the gene's expression values actually normally distributed? Do they actually have equal variance? How can you tell? Suppose we're developing a test for the presence of a protein that induces nocturnal behavior in Drosophila, FBN1. There are two events of interest in our sample space: F - FBN1 is present D - Our test claims to detect FBN1 The problem is that F and D are not identical - our test can produce false positives (be true when F is false) or false negatives (be false when F is true). Suppose that we know: P(F) = 0.0001 - the chance of a fly carrying FBN1 is 1 in 10000 P(D|F) = 1 - there are no false negatives P(D|~F) = 1/20000 - the false positive rate is 1 in 20000 What we care about when phenotyping flies, though, is P(F|D) - the probability of carrying FBN1 if the test claims it's there. Hint: P(X) = P(X∩Y) + P(X∩~Y) for any events X and Y. a. (2) Find P(F|D). b. (2) If you vary P(F) - that is, if you make it a little bigger or a little smaller - does P(F|D) change significantly? P01-2 c. (1) Is this a good test? d. (2 ) Explain why the hint is true. e. (2 ) A lesson in Drosophila gene naming philosophy: what do you think FBN stands for? 7. (0) How long, excluding extra credit, did this assignment take to complete? 8. An oldie but goodie: welcome, welcome, welcome, to Let's Make A Deal! Tonight, for your viewing pleasure, consider the following problem: before you are three doors, 1, 2, and 3. Behind one of these three doors lies an all-expenses-paid minimum wage fellowship that might barely cover your tuition for a year or two, courtesy of the NSF. Behind the other two doors lies NOTHING! You are allowed to choose one door; before you open it, your resident Program Officer will open one of the other doors to demonstrate that it does not contain the fellowship. After this action, you are given the choice of sticking with your original choice or switching to the other (non-opened) door. Do you stay or do you go? a. (1) Suppose our experiment is "the fellowship is behind some door." Write down the sample space and the events A1, A2, and A3 to represent "the fellowship is behind door (some number)." Also write down the events D1 through D3 to represent "the PO opens door (some number)." b. (1) Assuming a fair game, what are P(Ai) for i=1, 2, 3? c. (1) The PO will never open the door that the fellowship is behind, and he will never open the door that you pick. If you choose door one, what is P(D2|A1)? P(D2|A2)? P(D2|A3)? d. (2) Still supposing you chose door one, what is P(A3|D2)? In English, this means, "I just chose door one. The PO opened door two. What is the probability that the fellowship is behind door three?" Or in other words, what is the probability that I will get the fellowship if I change my decision? e. (1) Should you change your mind? Will this be true no matter which door we choose first and which door is opened? P01-3