Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Genomic Data Manipulation BIO508 Spring 2011 Problems 03 Python: I/O and Regular Expressions 1. The goal of this problem - which spans the entire problem set! - will be to nonparametrically detect genes on the ninth Saccharomyces cerevisiae chromosome that are differentially expressed under two different nutrient limitations. Specifically, yeast can typically synthesize both leucine and uracil, but auxotrophy can be induced by deleting one (or more) of the genes necessary for carrying out biosynthesis of these amino acids. When unable to synthesize these amino acids, yeast can survive by taking them up from the surrounding medium, but this is an "unexpected" condition with respect to typical nutrient sensing by its transcriptional regulatory network. We will test for transcripts that are differentially affected by a lack of leucine versus a lack of uracil. We stressed earlier that a common way for computers to be used in high-dimensional data processing is to perform nonparametric hypothesis test by shuffling data around to generate an empirical null distribution. For example, the null distribution of the difference in means between two samples drawn independently from the same underlying normally distributed population is a t-distribution. The null distribution of the difference in means between two sets of microarrays drawn from some Agilent machine that couldn't care less about normality or homoscedasticity is... who knows what! But we can perform our own hypothesis tests regardless of the "shape" of the underlying population by repeatedly shuffling and comparing test statistics. a. (0) Recall the yeast growth rate microarray data you downloaded for the first problem set, specifically the fully-imputed set of transcriptional responses available here: http://growthrate.princeton.edu/data/dilution_rate_02_knn.txt If you don't still have this file sitting around, download it and save it in a safe place. This will be your input file for this problem. b. (2) Let's start easy, with a few simple functions that will help us process the microarray data we're going to read in. Write a function listREMatches that takes two arguments, a list of strings and a string representing a regular expression. It should return a list of each index that matches the given RE. listREMatches should require approximately six lines of code, including the def. listREMatches( ["yabba", "dabba", "doo"], r'abba' ) [0, 1] listREMatches( ["here", "i", "go", "again"], r'[aeiou][rng]' ) [0, 3] listREMatches( ["queen", "myself", "flash", "1000", "let's"], r'^[a-z].*\'' ) [4] c. (2) Python makes it easy to concatenate lists using the + operator: [1, 2] + [3, 4] [1, 2, 3, 4] [4] + [] + [3, 2] + [] + [1, 2, 3, 4] [4, 3, 2, 1, 2, 3, 4] Using this piece of information, write a function testStatistic that takes two arguments, both lists of of numbers (of arbitrary length). It should return the difference of their means divided by their total standard deviation. That is, (μx-μy)/σx+y. Note that: i. You should import problems02 in order to use your mean and stdev functions from the previous assignment. In order to make this work, copy your problems02.py file to the same directory as your problems03.py file (if you know how, you can modify the PYTHONPATH environment variable instead; if you don't know what this means, ignore it). ii. The body of testStatistic should be exactly one line of code using two calls to problems02.mean and one to problems02.stdev. iii. For the statistics purists, this is not the correct definition of pooled standard deviation, but I'm simplifying it for sanity's sake. P02-1 testStatistic( [1, 2, 3], [4, 5] ) -1.25 (equal to ( mean( [1, 2, 3] ) - mean( [4, 5] ) ) / stdev( [1, 2, 3, 4, 5] )) 1.25 -0.603 testStatistic( [], [2] ) testStatistic( [4, 5], [1, 2, 3] ) testStatistic( [1], [2] ) testStatistic( [1], [2] ) testStatistic( [3, 4], [-5, -6] ) 1.732 d. (1) Using the import random command provides access to a function random.sample that takes two arguments, a list and an integer, and chooses a sublist of the requested length in random order from the given list: ai = [1, 2, 3] random.sample( ai, 2 ) [1, 2] or [1, 3] or [2, 1] or [2, 3] or [3, 1] or [3, 2] random.sample( ai, 3 ) [1, 2, 3] or [3, 2, 1] or [3, 1, 2] or [1, 3, 2] or ... Thus if you call random.sample with an integer equal to the list's length, it returns a shuffled copy of the list. Use this to write a simple one-line function shuffled that takes one argument, a list, and returns a copy of that list in random order. shuffled( [1, 2, 3] ) [1, 2, 3] or [3, 2, 1] or [3, 1, 2] or [1, 3, 2] or ... shuffled( [1] ) [1] shuffled( [1, -1] ) [1, -1] or [-1, 1] e. (0) Python functions can only return one value. However, sometimes it's useful to pretend to return multiple values by returning a list. Python allows you to "blow up" a list into multiple values by separating them with commas (technically referred to as multiple assignment, not blowing up). This provides the best of both worlds - functions still get to return just one list, but that list will act like two (or more) different values: def returnThreeValues( ): return [0, "one", [2, "three"]] iNumber, strString, aList = returnThreeValues( ) print( iNumber + 1 ) print( "This is a string: " + strString ) print( "The oneth array element is: " + aList[1] ) → 1, "This is a string: one", "The oneth array element is: three" def returnTwoArrays( iValue, strValue ): aiBananas = [30000, 8, 0] astrBananas = ["pounds", "foot", "today"] aiBananas.append( iValue ) astrBananas.append( strValue ) return [aiBananas, astrBananas] aiBananas, astrBananas = returnTwoArrays( 7, "phone" ) P02-2 for i in range( len( aiBananas ) ): print( "\t".join( [str(aiBananas[i]), astrBananas[i]] ) ) → "30000 pounds", "8 foot", "0 today", "7 phone" f. (4) Python allows you to select slices from the beginning or end of a list using the colon operator as an index. You can select the first n elements of a list by placing a value for n after a colon as an index: ai = [1, 2, 3, 4, 5] ai[:1] [1] ai[:3] [1, 2, 3] ai[:15] [1, 2, 3, 4, 5] i = 2 ai[:i] [1, 2] ai[:( i + 1 )] [1, 2, 3] Or you can select the last elements of a list starting from n by placing a value for n before a colon as an index: ai[1:] [2, 3, 4, 5] ai[3:] [4, 5] ai[15:] [] i = 2 ai[i:] [3, 4, 5] ai[( i + 1 ):] [4, 5] Combining the slice operator with your shuffled function, write a function subsample that takes three arguments: a list of numbers, a first length iOne (integer), and a second length iTwo (integer). subsample should return two lists (using the trick described above) containing the first iOne and last iTwo elements of a shuffled copy of the given list. That is, the body of the function implements the following four lines of code: i. Call shuffled to create a scrambled copy of the input list. ii. Create one shuffled list adOne by slicing off the beginning of the mixed list. This will make a new list of length iOne. iii. Create a second shuffled list adTwo by slicing off the end of the mixed list. This will make a new list of length iTwo. iv. Return [adOne, adTwo]. subsample( [1, 2, 3], 2, 1 ) [[1, 2], [3]] or [[3, 2], [1]] or [[3, 1], [2]] or ... subsample( [1], 1, 1 ) [[1], [1]] subsample( [1, -1], 1, 1 ) [[1], [-1]] or [[-1], [1]] g. (3) Time to start putting some of these together - don't worry, we're making great progress! Write a function called nullDistribution that takes the same three arguments as subsample, a list and two integers. It should return a list of length exactly 1,000 containing the values of our test statistic that would be expected by chance if the given list was randomly subsampled using the given lengths. nullDistribution should call subsample to shuffle and subsample the given list and testStatistic to calculate the null test statistic for the resulting randomized sublists. You can implement it by filling in the following blanks: def nullDistribution( adData, iOne, iTwo ): adRet = [] P02-3 for i in range( ____ ): adOne, adTwo = _________( ______, ____, ____ ) adRet.append( _____________( adOne, adTwo ) ) return _____ h. (2) We're now going to write three short utility functions that we'll use to wrap this thing up. Let's first augment Python's built-in list slicing colon operator by writing a function listSlice that takes two arguments, a list (of anything) and a list of integers. It should return a list containing only the elements of the first list at the indices specified by the second list. You can write listSlice using at most five lines of code (including the def). listSlice( [1, 2, 3], [0, 2] ) [1, 3] listSlice( [1, 2, 3], [2, 0] ) [3, 1] listSlice( [1, 2, 3], [0, 3] ) listSlice( [1], [0, 0, 0] ) [1, 1, 1] listSlice( [-1, 0, 1], [2, 0, 0, 1] ) [1, -1, -1, 0] listSlice( [], [] ) [] listSlice( [], [0] ) i. (4) So a few subparts ago, you wrote nullDistribution to calculate an empirical null distribution given some data and our testStatistic. Given a test statistic value from real data, how can we combine it with the null distribution to generate a p-value? Recall that a p-value is the probability of obtaining a test statistic value at least as extreme as our by chance - in other words, the fraction of the random test statistic values in the null distribution that are greater than or equal to a given real test statistic value. Stop and think about that for a second: If the null distribution is [8, 2, 5] and... ...our test statistic is 1, p = 1 ...our test statistic is 2, p = 1 ...our test statistic is 4, p = 2/3 ...our test statistic is 6, p = 1/3 ...our test statistic is 9, p = 0 Again for the stats purists, we're ignoring a few things here (specifically fenceposts), but again, it'll keep life a bit simpler. What's really exciting is that by calling import bisect, Python provides you with a function bisect.bisect_left that will calculate the insertion point for a new value in a sorted list (don't ask why it's called "bisect"...): bisect.bisect_left( bisect.bisect_left( bisect.bisect_left( bisect.bisect_left( bisect.bisect_left( [2, [2, [2, [2, [2, 5, 5, 5, 5, 5, 8], 8], 8], 8], 8], 1 2 4 6 9 ) ) ) ) ) 0 0 1 2 3 Using this gem, implement a function called pvalue that takes two arguments, a number (our real test statistic) and a list of numbers (the null distribution) and returns the empirical one-sided p-value (which is what we just described) using the following three lines of code: i. Create a copy of the input null distribution sorted in ascending order using sorted. ii. Find the index of the first value in this sorted list greater than or equal to the given test statistic using bisect_left. P02-4 iii. Dividing this index by the length of the list yields the fraction of null values less than the given test statistic. Subtract this from one to return the fraction greater than or equal to the value, i.e. our pvalue. Hint: what does Python do if you divide two integers? How can you avoid this? j. (1) Easy one-liner (literally!) Write a function oneToTwoTailed that takes as input a float representing a one-tailed p-value and returns the value it would have as a two-tailed test. You can do this in one line of code using some trickery with subtracting 0.5, abs, and multiplying by two, but you don't have to. oneToTwoTailed( 0 oneToTwoTailed( 1 oneToTwoTailed( 0.5 oneToTwoTailed( 0.1 oneToTwoTailed( 0.75 k. ) ) ) ) ) 0 0 1.0 0.2 0.5 (6) Now the rubber hits the road... write a function named readPCL that takes one argument, an I/O stream that reads in a microarray PCL file. It should return a list of four values: i. A list of strings identifying the microarray's conditions (column headers, the first row starting at the first data column). ii. A list of strings identifying the microarray's genes (row headers, the first column starting at the first data row). iii. A list of strings naming the microarray's genes in human-readable format (i.e. the values from the PCL file's second column). iv. A list of lists of floats representing the microarray's data. Each internal list should represent one row of data. Write readPCL by implementing the following pseudocode (in which each line represents exactly one line of real code): def readPCL( fileIn ): astrConditions = a new, empty list astrGenes = a new, empty list astrNames = a new, empty list aadData = a new, empty list for each line of the input file: astrLine = the line split into a list by tab characters "\t" if astrConditions isn't empty: append the first element of the list to astrGenes append the second element of the list to astrNames adData = a new, empty list for each string in the list starting at the fourth column: convert the string to a float append the float to adData append adData to aadData else: astrConditions = astrLine[starting at the fourth column:] return [astrConditions, astrGenes, astrNames, aadData] Note that you can test readPCL using code like the following: if __name__ == "__main__": astrConditions, astrGenes, astrNames, aadData = readPCL( sys.stdin ) # You could also use readPCL( open( "dilution_rate_02_knn.txt" ) ) # Or you could use readPCL( open( sys.argv[1] ) ) print( "\t".join( astrConditions ) ) P02-5 print( "\t".join( astrGenes ) ) print( "\t".join( astrNames ) ) for adData in aadData: astrData = [] for dDatum in adData: astrData.append( str(dDatum) ) print( "\t".join( astrData ) ) l. (0) Fooled you - you're done! Well, almost; you're done writing code, because here's the last function you need in order to analyze some real microarray data: def testSomeGenes( fileIn, strREGene, strREOne, strRETwo ): astrConditions, astrIDs, astrNames, aadData = readPCL( fileIn ) aiGenes = listREMatches( astrIDs, strREGene ) aiOne = listREMatches( astrConditions, strREOne ) aiTwo = listREMatches( astrConditions, strRETwo ) aastrRet = [] for iGene in aiGenes: adData = aadData[iGene] adNull = nullDistribution( adData, len( aiOne ), len( aiTwo ) ) dP = oneToTwoTailed( pvalue( testStatistic( listSlice( adData, aiOne ), listSlice( adData, aiTwo ) ), adNull ) ) if dP < 0.01: aastrRet.append( [astrIDs[iGene], astrNames[iGene], str(dP)] ) return aastrRet Note that I've taken advantage of all of your hard work up to this point to make this "outermost" function as simple as possible. Specifically, it: i. Turns an input file into Python objects using readPCL. ii. Finds the gene (row) indices that match a requested RE pattern. iii. Finds the condition (column) indices that match two requested RE patterns to compare. iv. Randomizes each of the requested genes' data to generate a null distribution. v. Finds the p-value for each requested gene's real test statistic by comparing the two sets of conditions. vi. Returns a list of lists containing the identifiers, names, and p-values of all significant genes (we're using a moderately conservative p-value of 0.01 instead of adjusting for multiple hypothesis tests). You can run the whole shebang in a few equivalent ways, depending on your Pythonic preference. My first choice would be: if __name__ == "__main__": # Genes from chromosome X differential between phosphate and nitrate aastrGenes = testSomeGenes( sys.stdin, r'^YJ', r'^U', r'^L' ) for astrGene in aastrGenes: print( "\t".join( astrGene ) ) Then run python problems03.py < dilution_rate_02_knn.txt If you'd rather provide a file name as input on the command line rather than redirect standard input, you could instead run: if __name__ == "__main__": # Genes from chromosome I differential between glucose and sulfate aastrGenes = testSomeGenes( open( sys.argv[1] ), r'^YA', r'^G', r'^S' ) for astrGene in aastrGenes: P02-6 print( "\t".join( astrGene ) ) Then run python problems03.py dilution_rate_02_knn.txt m. (2) Insert a comment or docstring in your submitted Python file explaining what the six regular expressions in the code above match and why they're appropriate to use in these examples. Note that the yeast S. cerevisiae has 16 chromosomes that are referred to as A through P, and its open reading frames (genes) are identified using identifiers like YAL001C: i. Y for "yeast". ii. A for chromosome I (can be A through P). iii. L for the left arm (can be L or R). iv. 001 for the first ORF from the centromere (counts up from one). v. C for the Crick strand (can be C or W). n. (4) Insert a comment/docstring listing the genes on chromosome IX differential between the auxotrophic limitations for leucine and uracil (there should be somewhere around 7-11, depending on exactly how your random null distributions end up). From reading the names of these genes, what biological processes do you think might be differentially regulated in yeast limited for leucine versus uracil? o. (3 ) Why? p. (3 ) Insert a comment/docstring listing the genes genome-wide differential between the prototrophic limitations (glucose, nitrate, phosphate, and sulfate) and the auxotrophic limitations (leucine and uracil). Also explain what regular expressions you used to select these genes and conditions. q. (4 ) Insert a comment/docstring listing some other set of genes differential between any two or more conditions - your choice. Explain the resulting set of genes and why their biology might be interesting. r. (4 ) Modify testSomeGenes to use a p-value Bonferroni-corrected for multiple hypothesis tests. s. (8 ) Modify testSomeGenes to calculate FDR q-values rather than p-values. P02-7