Download Problems 01

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression profiling wikipedia , lookup

Transcript
Genomic Data Manipulation
BIO508 Spring 2014
Problems 01
Quantitative methods
1.
(0) If you learn one thing from this class, it should be how to use a computer as a research tool. According to
Bill Bialek of Princeton, the best way to do research is to ask someone who already knows; for almost every
topic we'll cover, Google already knows. If you see a word you don't recognize, look it up. Wikipedia is
surprisingly accurate for statistics and computer science, and it's worth learning to use PubMed efficiently.
2.
(4) Find a publication in a journal with impact factor >3.0 that uses both descriptive and inferential statistical
methods to interpret biological data. The paper should focus primarily on biology, not methodology. Provide
the citation and one short paragraph highlighting A) what you consider the most important points made
using quantitative methods, B) a description of what those methods entailed, and C) why you like (or dislike)
the paper.
3.
Download one of the gene expression datasets available at:
http://growthrate.princeton.edu
There are several included in the publications listed on the site, formatted as tab-delimited text files or PCL
files (see https://genome.unc.edu/MicroArray/help/formats.shtml#pcl), both of which can be opened as plain
text using Excel or OpenOffice/LibreOffice (either one of which you can use to complete this problem). Each
column represents genome-wide transcript abundances for a single microarray (experimental condition).
Please note that this is not the case for the "raw" data - you can choose any of the files that look like this (see #6
for a specific suggestion):
a.
b.
c.
(1) Specify which dataset you chose and how many conditions it contains.
(2) How many genes (rows) in each condition are further than two standard deviations from the column
mean?
(3) How many are above the upper inner fence or below the LIF?
P01-1
d. (3 )
Using means, medians, standard deviations, and/or upper/lower fences, can you devise a quickand-dirty test for whether each column is approximately normally distributed? How many columns
aren't?
4.
(4) Using the data you downloaded above, generate any three of a bar chart, histogram, density plot,
cumulative distribution plot, stripchart, box plot, or scatter plot. You can use any software you'd like possibilities include Excel, OpenOffice/LibreOffice, Python, R, Matlab, Octave, Stata, SPSS, Gnuplot,
Matplotlib, Scilab, GraphPad, Gnumeric, Numbers, SciDAVis, Orange, RapidMiner, MeV, or about a half
dozen others I'm sure I haven't thought of. Most of those are free, so no worries if you don't like commercial
software.
5.
Differential expression (also known as biomarker discovery or class comparison) is one of the most abused
microarray analyses, second only to clustering (more on that later in the course). Fortunately, every test for
differential expression boils down at some level to a glorified t-test. So let's glorify some t-tests:
a.
(1) Using Excel, OpenOffice, or your software of choice, determine whether the first gene in your dataset
is significantly differentially expressed between the first 1/3 of the conditions and the latter 2/3. Assume
that the gene's expression values are normally distributed with equal variance, making a t-test
appropriate. Specify which dataset you used, which gene you tested, what formula and parameters you
used, and why. Note that this should require a single formula in a single cell to compute (i.e. if it seems
complicated, you're doing something wrong).
b. (2) Now determine how many genes in the genome are significantly upregulated in the first 1/3 of the
conditions relative to the latter 2/3. Specify what formula and parameters you used, whether they were
different from part c, if so why, and how many genes were significant at p=0.05.
c. (1) How many genes are in this particular genome? How many would you expect to be differentially
expressed by chance using this test and significance threshold?
d. (2) Let's discard our normality assumption and test for differential expression nonparametrically. Go
back to your single-gene test from part a. Copy-paste your data from Excel/OpenOffice (you may have to
transpose it first using Edit/Paste Special) into the online Mann-Whitney Wilcoxon calculator at:
http://www.fon.hum.uva.nl/Service/Statistics/Wilcoxon_Test.html
e.
6.
What p-value for differential expression do you obtain? Under what circumstances might a t-test be
more appropriate than a MWW, or vice versa?
(3 )
Are the gene's expression values actually normally distributed? Do they actually have equal
variance? How can you tell?
Suppose we're developing a test for the presence of a protein that induces nocturnal behavior in Drosophila,
FBN1. There are two events of interest in our sample space:
F - FBN1 is present
D - Our test claims to detect FBN1
The problem is that F and D are not identical - our test can produce false positives (be true when F is false) or
false negatives (be false when F is true). Suppose that we know:
P(F) = 0.0001 - the chance of a fly carrying FBN1 is 1 in 10000
P(D|F) = 1 - there are no false negatives
P(D|~F) = 1/20000 - the false positive rate is 1 in 20000
What we care about when phenotyping flies, though, is P(F|D) - the probability of carrying FBN1 if the test
claims it's there. Hint: P(X) = P(X∩Y) + P(X∩~Y) for any events X and Y.
a. (2) Find P(F|D).
b. (2) If you vary P(F) - that is, if you make it a little bigger or a little smaller - does P(F|D) change
significantly?
P01-2
c. (1) Is this a good test?
d. (2 )
Explain why the hint is true.
e. (2 )
A lesson in Drosophila gene naming philosophy: what do you think FBN stands for?
7.
(0) How long, excluding extra credit, did this assignment take to complete?
8.
An oldie but goodie: welcome, welcome, welcome, to Let's Make A Deal! Tonight, for your viewing
pleasure, consider the following problem: before you are three doors, 1, 2, and 3. Behind one of these three
doors lies an all-expenses-paid minimum wage fellowship that might barely cover your tuition for a year or
two, courtesy of the NSF. Behind the other two doors lies NOTHING! You are allowed to choose one door;
before you open it, your resident Program Officer will open one of the other doors to demonstrate that it does
not contain the fellowship. After this action, you are given the choice of sticking with your original choice or
switching to the other (non-opened) door. Do you stay or do you go?
a.
(1) Suppose our experiment is "the fellowship is behind some door." Write down the sample space and
the events A1, A2, and A3 to represent "the fellowship is behind door (some number)." Also write down
the events D1 through D3 to represent "the PO opens door (some number)."
b. (1) Assuming a fair game, what are P(Ai) for i=1, 2, 3?
c. (1) The PO will never open the door that the fellowship is behind, and he will never open the door that
you pick. If you choose door one, what is P(D2|A1)? P(D2|A2)? P(D2|A3)?
d. (2) Still supposing you chose door one, what is P(A3|D2)? In English, this means, "I just chose door one.
The PO opened door two. What is the probability that the fellowship is behind door three?" Or in other
words, what is the probability that I will get the fellowship if I change my decision?
e. (1) Should you change your mind? Will this be true no matter which door we choose first and which
door is opened?
P01-3