Download Problems 03: Reading and Writing Data File

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Genomic Data Manipulation
BIO508 Spring 2014
Problems 03
Python: Reading and Writing Data
1.
The goal of this problem - which spans the entire problem set! - will be to nonparametrically detect genes on
the ninth Saccharomyces cerevisiae chromosome that are differentially expressed under two different nutrient
limitations. Specifically, yeast can typically synthesize both leucine and uracil, but auxotrophy can be induced
by deleting one (or more) of the genes necessary for carrying out biosynthesis of these amino acids. When
unable to synthesize these amino acids, yeast can survive by taking them up from the surrounding medium,
but this is an "unexpected" condition with respect to typical nutrient sensing by its transcriptional regulatory
network. We will test for transcripts that are differentially affected by a lack of leucine versus a lack of uracil.
We stressed earlier that a common way for computers to be used in high-dimensional data processing is to
perform nonparametric hypothesis test by shuffling data around to generate an empirical null distribution.
For example, the null distribution of the difference in means between two samples drawn independently
from the same underlying normally distributed population is a t-distribution. The null distribution of the
difference in means between two sets of gene expression data drawn from some experimental process that
couldn't care less about normality or homoscedasticity is... who knows what! But we can perform our own
hypothesis tests regardless of the "shape" of the underlying population by repeatedly shuffling and
comparing test statistics.
a.
(0) Recall the yeast growth rate expression data you downloaded for the first problem set, specifically
the fully-imputed set of transcriptional responses available here:
http://growthrate.princeton.edu/data/dilution_rate_02_knn.txt
If you don't still have this file sitting around, download it and save it in a safe place. This will be your
input file for this problem.
b.
(3) Let's start easy, with a few simple functions that will help us process the expression data we're going
to read in. Write a function startsWith that takes two arguments, a string and a list of target substrings
to search for. It should return True if the first string (first argument) starts with any of the targets given
in the list (second argument), and False otherwise. startsWith should require approximately five lines
of code, including the def.
startsWith( "yabba", ["dab", "do"] )  False
startsWith( "yabba", ["yab", "dab"] )  True
startsWith( "yabba", [] )  False
startsWith( "dabba", ["d", "y"] )  True
startsWith( "doo", ["d", "do", "doo"] )  True
startsWith( "doo", ["doo"] )  True
a.
(3) Continue with a function listStartsWith that also takes two arguments, both lists of strings. It
should use startsWith to return a list of each index in the first argument that starts with any of the
substrings given in the second argument. listStartsWith should require approximately six lines of code,
including the def, and its English equivalent should be something like, "starting with an empty return
list, for each index in the strings I'm searching, add in the current index if its corresponding search string
starts with any of the targets; after searching everything, return the list."
listStartsWith( ["yabba", "dabba", "doo"], ["da", "do"] )  [1, 2]
listStartsWith( ["yabba", "dabba", "doo"], ["a", "b", "c"] )  []
P03-1
listStartsWith( ["yabba", "dabba", "doo"], ["x", "y", "z"] )  [0]
listStartsWith( ["peter", "piper", "picked", "pickled", "peppers"], ["pe", "per"] )  [0, 4]
listStartsWith( ["peter", "piper", "picked", "pickled", "peppers"], ["pip", "pop"] )  [1]
b.
(2) Python makes it easy to concatenate lists using the + operator:
[1, 2] + [3, 4]  [1, 2, 3, 4]
[4] + [] + [3, 2] + [] + [1, 2, 3, 4]  [4, 3, 2, 1, 2, 3, 4]
Using this piece of information, write a function testStatistic that takes two arguments, both lists of
of numbers (of arbitrary length). It should return the difference of their means divided by their total
standard deviation. That is, (μx-μy)/σx+y. Note that:
i. You should import problems02 in order to use your mean and stdev functions from the
previous assignment. In order to make this work, copy your problems02.py file to the same
directory as your problems03.py file (if you know how, you can modify the PYTHONPATH
environment variable instead; if you don't know what this means, ignore it).
ii. The body of testStatistic should be exactly one line of code using two calls to
problems02.mean and one to problems02.stdev.
iii. For the statistics purists, this is not the correct definition of pooled standard deviation, but I'm
simplifying it for sanity's sake.
testStatistic( [1, 2, 3], [4, 5] )  -1.581...
(equal to ( mean( [1, 2, 3] ) - mean( [4, 5] ) ) / stdev( [1, 2, 3, 4, 5] ))
testStatistic( [4, 5], [1, 2, 3] )
 1.581...
 -1.414...
testStatistic( [], [2] ) 
testStatistic( [1], [2] )
testStatistic( [3, 4], [-5, -6] )
c.
 1.721...
(1) Using the import random command provides access to the random module:
http://docs.python.org/library/random.html
This includes a function random.sample that takes two arguments, a list and an integer, and chooses a
sublist of the requested length in random order from the given list:
ai = [1, 2, 3]
random.sample( ai, 2 )
 [1, 2] or [1, 3] or [2, 1] or [2, 3] or [3, 1] or [3, 2]
random.sample( ai, 3 )
 [1, 2, 3] or [3, 2, 1] or [3, 1, 2] or [1, 3, 2] or ...
Thus if you call random.sample with an integer equal to the list's length, it returns a shuffled copy of
the list. Use this to write a simple one-line function shuffled that takes one argument, a list, and returns
a copy of that list in random order.
shuffled( [1, 2, 3] )  [1, 2, 3] or [3, 2, 1] or [3, 1, 2] or [1, 3, 2] or ...
shuffled( [1] )  [1]
shuffled( [1, -1] )  [1, -1] or [-1, 1]
d. (0) Python functions can only return one value. However, sometimes it's useful to pretend to return
multiple values by returning a list. Python allows you to "blow up" a list into multiple values by
separating them with commas (technically referred to as multiple assignment, not blowing up). This
provides the best of both worlds - functions still get to return just one list, but that list will act like two (or
more) different values:
P03-2
def returnThreeValues( ):
return [0, "one", [2, "three"]]
iNumber, strString, aList = returnThreeValues( )
print( iNumber + 1 )
print( "This is a string: " + strString )
print( "The oneth array element is: " + aList[1] )
→ 1, "This is a string: one", "The oneth array element is: three"
def returnTwoArrays( iValue, strValue ):
aiBananas = [30000, 8, 0]
astrBananas = ["pounds", "foot", "today"]
aiBananas.append( iValue )
astrBananas.append( strValue )
return [aiBananas, astrBananas]
aiBananas, astrBananas = returnTwoArrays( 7, "phone" )
for i in range( len( aiBananas ) ):
print( "\t".join( [str(aiBananas[i]), astrBananas[i]] ) )
→ "30000 pounds", "8 foot", "0 today", "7 phone"
e.
(4) Python allows you to select slices from the beginning or end of a list using the colon operator as an
index. You can select the first n elements of a list by placing a value for n after a colon as an index:
ai = [1, 2, 3, 4, 5]
ai[:1]  [1]
ai[:3]  [1, 2, 3]
ai[:15]  [1, 2, 3, 4, 5]
i = 2
ai[:i]  [1, 2]
ai[:( i + 1 )]  [1, 2, 3]
Or you can select the last elements of a list starting from n by placing a value for n before a colon as an
index:
ai[1:]  [2, 3, 4, 5]
ai[3:]  [4, 5]
ai[15:]  []
i = 2
ai[i:]  [3, 4, 5]
ai[( i + 1 ):]  [4, 5]
Combining the slice operator with your shuffled function, write a function subsample that takes three
arguments: a list of numbers, a first length iOne (integer), and a second length iTwo (integer).
subsample should return two lists (using the trick described above) containing the first iOne and last
iTwo elements of a shuffled copy of the given list. That is, the body of the function implements the
following four lines of code:
i.
Call shuffled to create a scrambled copy of the input list.
ii.
Create one shuffled list adOne by slicing off the beginning of the mixed list. This will make a new
list of length iOne.
P03-3
iii.
iv.
Create a second shuffled list adTwo by slicing off the end of the mixed list. This will make a new
list of length iTwo.
Return [adOne, adTwo].
subsample( [1, 2, 3], 2, 1 )  [[1, 2], [3]] or [[3, 2], [1]] or [[3, 1], [2]] or ...
subsample( [1], 1, 1 )  [[1], [1]]
subsample( [1, -1], 1, 1 )  [[1], [-1]] or [[-1], [1]]
f.
(3) Time to start putting some of these together - don't worry, we're making great progress! Write a
function called nullDistribution that takes the same three arguments as subsample, a list and two
integers. It should return a list of length exactly 1,000 containing the values of our test statistic that would
be expected by chance if the given list was randomly subsampled using the given lengths.
nullDistribution should call subsample to shuffle and subsample the given list and
testStatistic to calculate the null test statistic for the resulting randomized sublists. You can
implement it by filling in the following blanks:
def nullDistribution( adData, iOne, iTwo ):
adRet = []
for i in range( ____ ):
adOne, adTwo = _________( ______, ____, ____ )
adRet.append( _____________( adOne, adTwo ) )
return _____
g.
(2) We're now going to write three short utility functions that we'll use to wrap this thing up. Let's first
augment Python's built-in list slicing colon operator by writing a function listSlice that takes two
arguments, a list (of anything) and a list of integers. It should return a list containing only the elements of
the first list at the indices specified by the second list. You can write listSlice using about five lines of
code (including the def).
listSlice( [1, 2, 3], [0, 2] )  [1, 3]
listSlice( [1, 2, 3], [2, 0] )  [3, 1]
listSlice( [1, 2, 3], [0, 3] ) 
listSlice( [1], [0, 0, 0] )  [1, 1, 1]
listSlice( [-1, 0, 1], [2, 0, 0, 1] )  [1, -1, -1, 0]
listSlice( [], [] )  []
listSlice( [], [0] ) 
h. (4) So a few subparts ago, you wrote nullDistribution to calculate an empirical null distribution
given some data and our testStatistic. Given a test statistic value from real data, how can we
combine it with the null distribution to generate a p-value? Recall that a p-value is the probability of
obtaining a test statistic value at least as extreme as our by chance - in other words, the fraction of the
random test statistic values in the null distribution that are greater than or equal to a given real test statistic
value. Stop and think about that for a second:
If the null distribution is [8, 2, 5] and...
...our test statistic is 1, p = 1
...our test statistic is 2, p = 1
...our test statistic is 4, p = 2/3
...our test statistic is 6, p = 1/3
P03-4
...our test statistic is 9, p = 0
Again for the stats purists, we're ignoring a few things here (specifically fenceposts), but again, it'll keep
life a bit simpler. What's really exciting is the import bisect command:
http://docs.python.org/library/bisect.html
Here, Python provides you with a function bisect.bisect_left that will calculate the insertion point
for a new value in a sorted list (don't ask why it's called "bisect"...):
bisect.bisect_left(
bisect.bisect_left(
bisect.bisect_left(
bisect.bisect_left(
bisect.bisect_left(
[2,
[2,
[2,
[2,
[2,
5,
5,
5,
5,
5,
8],
8],
8],
8],
8],
1
2
4
6
9
)
)
)
)
)





0
0
1
2
3
Using this gem, implement a function called pvalue that takes two arguments, a number (our real test
statistic) and a list of numbers (the null distribution) and returns the empirical one-sided p-value (which
is what we just described) using the following three lines of code:
i.
Create a copy of the input null distribution sorted in ascending order using sorted.
ii.
Find the index of the first value in this sorted list greater than or equal to the given test statistic
using bisect_left.
iii.
Dividing this index by the length of the list yields the fraction of null values less than the given
test statistic. Subtract this from one to return the fraction greater than or equal to the value, i.e.
our p-value. Hint: what does Python do if you divide two integers? How can you avoid this?
i.
(1) Easy one-liner (literally!) Write a function oneToTwoTailed that takes as input a float representing a
one-tailed p-value and returns the value it would have as a two-tailed test. You can do this in one line of
code using some trickery with subtracting 0.5, abs, and multiplying by two, but you don't have to.
oneToTwoTailed( 0
oneToTwoTailed( 1
oneToTwoTailed( 0.5
oneToTwoTailed( 0.1
oneToTwoTailed( 0.75
j.
)
)
)
)
)





0
0
1.0
0.2
0.5
(6) Now the rubber hits the road... write a function named readPCL that takes one argument, an I/O
stream that reads in a microarray PCL file. It should return a list of four values:
i.
A list of strings identifying the microarray's conditions (column headers, the first row starting at
the first data column).
ii.
A list of strings identifying the microarray's genes (row headers, the first column starting at the
first data row).
iii.
A list of strings naming the microarray's genes in human-readable format (i.e. the values from the
PCL file's second column).
iv.
A list of lists of floats representing the microarray's data. Each internal list should represent one
row of data.
Write readPCL by implementing the following pseudocode (in which each line represents exactly one
line of real code):
def readPCL( fileIn ):
astrConditions = a new, empty list
astrGenes = a new, empty list
P03-5
astrNames = a new, empty list
aadData = a new, empty list
for each line of the input file:
astrLine = the line split into a list by tab characters "\t"
if astrConditions isn't empty:
append the first element of the list to astrGenes
append the second element of the list to astrNames
adData = a new, empty list
for each string in the list starting at the third column:
convert the string to a float
append the float to adData
append adData to aadData
else:
astrConditions = astrLine[starting at the third column:]
return [astrConditions, astrGenes, astrNames, aadData]
Note that you can test readPCL using code like the following:
if __name__ == "__main__":
astrConditions, astrGenes, astrNames, aadData = readPCL( sys.stdin )
# You could also use readPCL( open( "dilution_rate_02_knn.txt" ) )
# Or you could use
readPCL( open( sys.argv[1] ) )
print( "\t".join( astrConditions ) )
print( "\t".join( astrGenes ) )
print( "\t".join( astrNames ) )
for adData in aadData:
astrData = []
for dDatum in adData:
astrData.append( str(dDatum) )
print( "\t".join( astrData ) )
k.
(0) Fooled you - you're done! Well, almost; you're done writing code, because here's the last function you
need in order to analyze some real gene expression data:
def testSomeGenes( fileIn, astrTargetsGene, astrTargetsOne, astrTargetsTwo ):
astrConditions, astrIDs, astrNames, aadData = readPCL( fileIn )
aiGenes = listStartsWith( astrIDs, astrTargetsGene )
aiOne = listStartsWith( astrConditions, astrTargetsOne )
aiTwo = listStartsWith( astrConditions, astrTargetsTwo )
aastrRet = []
for iGene in aiGenes:
adData = aadData[iGene]
adNull = nullDistribution( adData, len( aiOne ), len( aiTwo ) )
dP = oneToTwoTailed( pvalue( testStatistic( listSlice( adData, aiOne ),
listSlice( adData, aiTwo ) ), adNull ) )
if dP < 0.01:
aastrRet.append( [astrIDs[iGene], astrNames[iGene], str(dP)] )
return aastrRet
Note that I've taken advantage of all of your hard work up to this point to make this "outermost" function
as simple as possible. Specifically, it:
i.
Turns an input file into Python objects using readPCL.
ii.
Finds the gene (row) indices that match a requested RE pattern.
iii.
Finds the condition (column) indices that match two requested RE patterns to compare.
iv.
Randomizes each of the requested genes' data to generate a null distribution.
P03-6
v.
vi.
Finds the p-value for each requested gene's real test statistic by comparing the two sets of
conditions.
Returns a list of lists containing the identifiers, names, and p-values of all significant genes (we're
using a moderately conservative p-value of 0.01 instead of adjusting for multiple hypothesis
tests).
You can run the whole shebang in a few equivalent ways, depending on your Pythonic preference. My
first choice would be:
if __name__ == "__main__":
# Genes from chromosome X differential between phosphate and nitrate
aastrGenes = testSomeGenes( sys.stdin, ["YJ"], ["P"], ["N"] )
for astrGene in aastrGenes:
print( "\t".join( astrGene ) )
Then run python problems03.py < dilution_rate_02_knn.txt
If you'd rather provide a file name as input on the command line rather than redirect standard input, you
could instead run:
if __name__ == "__main__":
# Genes from chromosome I differential between glucose and sulfate
aastrGenes = testSomeGenes( open( sys.argv[1] ), ["YA"], ["G"], ["S"] )
for astrGene in aastrGenes:
print( "\t".join( astrGene ) )
Then run python problems03.py dilution_rate_02_knn.txt
Note that the yeast S. cerevisiae has 16 chromosomes that are referred to as A through P, and its open
reading frames (genes) are identified using identifiers like YAL001C:
i.
Y for "yeast".
ii.
A for chromosome I (can be A through P).
iii.
L for the left arm (can be L or R).
iv.
001 for the first ORF from the centromere (counts up from one).
v.
C for the Crick strand (can be C or W).
l.
(4) Insert a comment/docstring listing the genes on chromosome IX differential between the auxotrophic
limitations for leucine and uracil (there should be somewhere around 7-11, depending on exactly how
your random null distributions end up). From reading the names of these genes, what biological
processes do you think might be differentially regulated in yeast limited for leucine versus uracil?
m. (3 )
Why?
n. (3 )
Insert a comment/docstring listing the genes genome-wide differential between the prototrophic
limitations (glucose, nitrate, phosphate, and sulfate) and the auxotrophic limitations (leucine and uracil).
Also explain in the comment what search strings you used to select these genes and conditions.
o.
(4 )
Insert a comment/docstring listing some other set of genes differential between any two or more
conditions - your choice. Explain the resulting set of genes and why their biology might be interesting.
p. (4 )
Modify testSomeGenes to use a p-value Bonferroni-corrected for multiple hypothesis tests. Use
testSomeGenesBonferroni as your modified function name.
P03-7
q.
(8 )
Modify testSomeGenes to calculate FDR
testSomeGenesFDR as your modified function name.
q-values
rather
than
p-values.
Use
P03-8