Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Problem Set 1 Please make sure to show your work and calculations and state any assumptions you make in answering the following questions. Include the names of the people you worked with at the top of your problem set. I. Biology (35 Points total) 1 DNA and RNA structure: Nucleic acid polymers are the basis for genetic information storage and transfer in the cell (5 points) 1.1 What is the monomeric unit of DNA called? (1 point) 1.2 What is the monomeric unit of RNA called? (1 point) 1.3 What is the generic term for both of these units? (1 point) 1.4 DNA is usually present in the cell in the form of a double-helix. Explain what this structure is and why it is important to the function of DNA. (Keywords: Replication, redundancy, anti-parallel, complementary base pairs.) (2 points) 2 Proteins are polymers that perform the intended function of most genes (5 points) 2.1 What is the monomeric unit of a protein? (1 point) 2.2 How many different types of monomers typically exist? (1 point) 2.3 Due to the nature of the direction of protein synthesis and the structure of proteins a protein is usually referred to as having an N and C-terminal end. Why are the symbols N and C used? (1 point) 2.4 Proteins are described as having four levels of structure (primary-, secondary-, tertiary-, and quaternary-structure). For each of the following list the category (or categories) they belong to. Note that some may be belong to more than one category. (2 points) 2.4.1 Amino Acid Sequence? 2.4.2 Alpha Helices? 2.4.3 HIV protease dimer? 2.4.4 Disulfide bonds? 3 An understanding of the central dogma and the structure of DNA, RNA and proteins will help you answer these questions (10 points) 3.1 In the eukaryotic cell the size of the mature mRNA that is translated is smaller than the gene sequence. Why is this the case? (Keywords: transcription, Promoter, Exon, Intron, Splicing) 3.2 For a given expressed gene, the length (in monomers) of the resultant protein (assuming no post processing), is less than 1/3 the size of the mature mRNA. Why? (Keywords: Tri-Nucleotide Codon, Start Codon, Stop Codon, t-RNA, 3’ and 5’ UTR, ORF.) 4 You will need to understand the genetic code to answer these questions. (5 Points) 4.1 What six codons encode for Serine? (1 point) 4.2 List all of the codons that do not code for an amino acid. What is their purpose? (2 points) 4.3 What amino acid does the codon ATG encode for? Other than coding for an amino acid, does this codon perform any special function in protein biosyntheses? (2 points) 5 Eukaryotic and prokaryotic organisms differ in many aspects. For each of the cellular structures or characteristics listed below, please identity whether it belongs to eukaryotic or prokaryotic organisms or both. (5 points) 5.1 Membrane-bound organelles (1 point) 5.2 Nucleus (1 point) 5.3 70S ribosome (1 point) 5.4 RNA splicing (1 point) 5.5 microRNAs (1 point) 6 Speculate on a biological problem that might be interesting to investigate with computational methods. Think of this as a possible subject for your final project (5 Points) II. Perl Program (35 points total + 5 bonus points) Please submit your code and output in separate files. You are working in a lab investigating the properties of the SARS virus. This virus was recently sequenced and identified to be a member of the corona virus family. One of your labmates has recently received frozen respiratory tract cell samples taken from Toronto-area patients either suspected or confirmed to have SARS. Initial PCR-based tests by the clinicians in Toronto-area hospitals suggest that a new strain has emerged – while its disease pathology is similar to that observed for the original strain isolated in China, the mortality rate is two-fold higher than the original strain. Your lab is in the middle of preparing samples from the new strain for sequencing. In the meantime, your advisor has asked you to prepare a software tool that would identify putative ORFs and design oligonucleotide probes to add to your core facility's human DNA microarrays. In order to craft this program, you will start with the existing SARS sequence so that when your lab finishes the sequencing, you will be able to quickly generate the necessary oligos for the new strain. Your core facility uses 70-mers for its human oligonucleotide microarray probe set, with a mean melting temperature (Tm) of 67-69 °C. As an initial step for this process, write a Perl script that takes the existing sequence data of the SARS genome and generates a non-overlapping list of 70-mer oligonucleotides within the specified Tm window. You can use the skeleton code to help get you started. Your code should be able to do the following: 1. Calculate GC content for a test 70-mer (10 points for this section): a. In order to do this, you will need the following line of code: $c = $oligo =~ s/c//gi; Explain what this line does as a comment in your Perl script (hint in skeleton). (5 points). b. Output the full oligo sequence and its GC content to the screen (5 points). 2. Calculate Tm (described in the skeleton code) for a test 70-mer. (5 points) 3. Read in the SARS genome and parse through all possible 70-mers, calculate GC content and Tm for each 70-mer in the SARS genome, filter out 70-mers that don’t satisfy the Tm requirement (between 67-69 °C), and store the filtered results in an array variable (15 points). 4. The oligos you obtained from above may be overlapping with each other. Here you’re asked to filter out the overlapping ones and output a list of nonoverlapping oligos with the starting position, oligonucleotide sequence, GC content, and Tm. You should have a total of four tab-separated columns. (5 points) Hint: to output the STDOUT (i.e. the screen) into a file, reroute the output into the file with the following syntax: program.pl [switches] > output.txt Bonus: an important factor in oligo design is to mask repetitive sequences to minimize non-specific hybridization. Most oligo design programs have a fairly comprehensive set of repetitive sequences that are masked from the oligo design space. Here, filter your list of results by removing oligonucleotides that have a homo-polynucleotide tract 5 or more bases in length. Note that you should perform your filtering on the list of all possible qualifying oligos, not the list of non-overlapping oligos (5 points). III. Excel tutorial (30 points total + 5 bonus points) This exercise is designed to get you used to using Excel for general data analysis tasks. You will need some data to work with. You will be looking at genomic expression profiles used to classify cancer types. Don’t worry about how they were created or what they mean you will learn more about expression profiling later. If you want to learn more about the data look here http://wwwgenome.wi.mit.edu/mpr/publications/projects/Leukemia/Files_descriptions.txt. Download the following files. http://wwwgenome.wi.mit.edu/mpr/publications/projects/Leukemia/table_ALL_AML_samples.txt http://wwwgenome.wi.mit.edu/mpr/publications/projects/Leukemia/data_set_ALL_AML_train.txt Open table_ALL_AML_samples.txt and data_set_ALL_AML_train.txt in excel and use the Text Import Wizard to parse the Tab delimitated data. Push Next Push Finish Now merge the two separate work books into one workbook. Go to the Excel window named table_ALL_AML_samples.txt and right click the tab at the bottom. Select “Move or Copy” and Move the selected sheet to the data_set_ALL_AML_train.txt book. Save merged work sheet as Excel Type and name ps1_firstname_lastname.xls. The worksheet table_ALL_AML_samples describes all of samples in the study. You need to discriminate between ALL and AML samples in the training set (INITIAL SET). Use this sheet to determine which samples are which. The worksheet data_set_ALL_AML_train contains all of the expresson data for each of the samples. The first two columns are Gene Descirption and Gene Accession. Every paired column after that represents gene expression and call values for that sample. Look at row 1946. The gene name is TNFR2 Tumor necrosis factor receptor 2 (75kD). Cell C1946 is the expression value for sample 1 and cell D1946 is the call value. The three different values for calls are A (Gene is Absent), P (Gene is present) and M (The call is marginal). Insert two new worksheets into the workbook. Name them data_set_ALL_AML_train_exp and data_set_ALL_AML_train_call. Copy the data from data_set_ALL_AML_train to each of these new worksheets. For worksheets data_set_ALL_AML_train_exp delete every column with header call. For worksheet data_set_ALL_AML_train_call rename the call header with the previous sample ID number. And then delete the columns containing expression values. Go to data_set_ALL_AML_train_exp. We are going to apply a two population Distance Measurement to the ALL and AML expression values. First we will try the Students tTest. Go to cell AO1 should be the first blank cell if you deleted the call columns. Type tTest in that column. Go to the menu Tool -> Add-Ins … and select Analysis ToolPak. Go to cell AO2 and insert the function TTEST by using the Insert -> Function dialog. Array1 will contain the values for the row 2 genes and ALL samples, Array2 will contain the values for the row 2 genes and AML samples. Tails will contain 1 for a 1 tailed distribution and type is 3 for two-sample unequal variance. It should look like this. Now we will try another distance measurement called Signal-to-noise or S2N. For a description see http://www-genome.wi.mit.edu/cancer/software/genecluster2/gc_ref.html. The Signal-to-Noise measure the difference of the means in each of the classes scaled by the sum of the standard deviations: S2N = ( 1 - 2) / ( 1 + 2) 1 is the mean of class 1 and 1 is the standard deviation of class 1. Apply this formula to the data in a similar manner as you did with the t-Test. You can use the STDEV and AVERAGE functions. No go to worksheet data_set_ALL_AML_train_call. Create three columns to the far left titled “Count_present_ALL”, “Count_present_ALM” and Hyper_Geometric. In Count_present_ALL use the Excel COUNTIF function to count the number of present calls in ALL for each gene and in Count_present_ALM do the same for ALM. Now in the column Hyper_Geometric use the Excel function HYPGEOMDIST to calculate the hypergeometric distribution for each gene. Don’t worry too much about hypergeometric distribution as you will get more background on that later. Look at the screen shot below to get a hint of how to use this function. List each of the following in a new Excel sheet. Make sure you keep the separate entries separate. a) The six rows, including Gene Description, with lowest t-Test values. (10 points) b) The three highest and the three lowest (Most Negative) S2N rows. (10 points) c) The six rows with the lowest hyper geometric distribution. (10 points) Do you see any common genes in the three lists above? Submit only the data you created for a), b) and c) with your homework. Bonus: Linear Programming using Excel (5 points) Please note that you only need to do one of the following two bonus problems (either A or B) in order to get the extra credit. We recommend Bonus B since it’s more biologically relevant. Bonus A: In a later problem set, you’ll be asked to solve a linear programming problem using Excel. To prepare yourself on this, you will use the on line tutorial Teaching Linear Programming using Microsoft Excel Solver, by Ziggy MacDonald at the University of Leicester. The tutorial can be accessed here http://www.economics.ltsn.ac.uk/cheer/ch9_3/ch9_3p07.htm. Provide the sensitivity report in Excel format as proof that you did the exercise. (5 points) Note that there are typos on that website: the objective function should be 3x1 + 5x2 (in the problem formulation, before Figure 1) and 3*B9+5*B10 (on entering in Excel, right after Figure 1). Bonus B: Using Excel to Solve the Flux Balance Example in Lecture 2 In this tutorial, you are going to learn how a linear program (LP), for example, the flux balance example in Lecture 2, can be solved using the LP solver in Microsoft Excel. In Lecture 2, the basic concept of flux balance analysis was illustrated through a simple example involving the following metabolic network. A is a nutrient (raw material). After it enters the cell, A can be converted to either B + C or B + D. B, C, and D exit the cell as products. The input flux of A is limited to 1 mol/sec. Given that the (relative) values of C and D are 1 and 3, the objective is to maximize the total value. Figure 1: Metabolic network for the FBA example Using only two decision variables, this problem can be formulated as the following LP: maximize x1 + 3 x2 (objective) subject to where x1 + x2 <= 1 (limit of input flux) x1, x2 >= (Non-negativity requirements) 0 x1 = flux of the reaction converting A to B and C x2 = flux of the reaction converting A to B and D. Having formulated the problem, and yours in the future may have substantially more decision variables and constraints, you can then use Excel to solve it. First, you need to make sure that the Solver add-in is installed in your Excel. This feature is installed if you can see the Solver option in the Tools menu. If Excel is setup on your machine by a default installation, this feature is usually not included. Then you need to add it by carrying out the following once-only steps: 1. Select the menu option Tools | Add_Ins (this will take a few moments to load the necessary file). 2. From the dialogue box presented check the box for Solver Add-In. 3. On clicking OK, you will then be able to access the Solver option from the new menu option Tools | Solver. Now you can start entering the LP into Excel. The best approach to entering the problem into Excel is first to list in a column the names of the objective function, decision variables and constraints. You can then enter some arbitrary starting values in the cells for the decision variables, usually zero, as shown below. Excel will vary the values of the cells as it determines the optimal solutions. Having assigned the decision variables with some arbitrary starting values you can then use these cell references explicitly in writing the formulae for the objective function and constraints, remembering to start each formula with an '=' . Figure 2: Setting up the problem in Excel The objective function in B5 will be given by: =B9+3*B10 The constraints will be given by (putting the right hand side {RHS} values in the adjacent cells): Input limit Non-neg 1 Non neg 2 (B14) (B15) (B16) =B9+B10 =B9 =B10 You are now ready to use Solver. On selecting the menu option Tools | Solver the dialogue box shown in Figure 3 is revealed, and if you select the objective cell before invoking Solver the correct Target Cell will be identified. This is the value Solver will attempt either to maximize or minimize Figure 3: The Solver Dialogue Box Select whether you wish to minimize this or maximize the problem, in this case you would want to set the target cell (the objective) to a Max. Note that you can use Solver to find the outcome that will achieve a specified value for the target cell by clicking 'Value of:'. In doing this you can use Solver as a glorified goal seeker. Next you enter the range of cells you want Solver to vary, the decision variables. Click on the white box and select cells B9 & B10, or alternatively type them in. Note that you can try to get Solver to guess which cells you want to vary by clicking the 'Guess' button. If you have defined your problem in a logical way Solver should usually get these right. You can now enter the constraints by first clicking the 'Add ..' button. This reveals the dialogue box shown in Figure 4. Figure 4: Entering Constraints The cell reference is to the cell containing your constraint formula, so for the first constraint you enter B14. By default <= is selected but you can change this by clicking on the drop down arrow to reveal a list of other constraint types. In the right hand white box you enter the cell reference to the cell containing the RHS value, which for the first constraint is cell C14. You then click 'Add' to add the rest of the constraints, remembering to include the non-negativity constraints. Having added all the constraints, click 'OK' and the Solver dialogue box should look like that shown in Figure 5. Figure 5: The Completed Solver Dialogue Box Before clicking 'Solve' it is good practice when doing LPs to go into the Options and check the 'Assume Linear Model' box, unless, of course, your model isn't linear (Solver can handle most mathematical program types, including non-linear and integer problems). Doing this can speed up the length of time taken for Solver to find a solution to the problem and in fact, it will also ensure the correct result and quite importantly, provide the relevant sensitivity report. Having selected this option you are now ready to Click 'Solve' and see Solver find the optimal values for x1 and x2. On doing this, at the bottom of the screen Excel will inform you of Solver's progress, then on finding an optimal solution the dialogue box shown in Figure 6 will appear. You will also observe that Solver has altered all the values in your spreadsheet, replacing them with the optimal results. You can use the Solver Results dialogue box to generate three reports. To select all three at once, click each one in turn. Figure 6: Solver Results At the same time it's often a good idea to get Solver to restore your original values in the spreadsheet so that you can return to the original problem formulation and make adjustments to the model such as altering the availability of resources. The three reports are generated in new sheets in the current workbook of Excel. The Answer Report, as shown in Figure 7, gives details of the solutions (in this case, the objective is maximized at 3 when x2=1 and x1=0) and information concerning the status of each constraint with accompanying slack/surplus values is provided. Figure 7: Answer Report The Sensitivity Report provides information about how sensitive your solution is to changes in the constraints. The report is fairly standard, providing information on shadow values, reduced cost and the upper and lower limits for the decision variables and constraints. The Limits Report also provides sensitivity information on the RHS values. All the reports can simply be copied and pasted into Word and this is perhaps one of the big advantages of using Excel over a DOS based LP solver. Although the reports paste into Word as tables, they are easily converted into text and can then be manipulated if one is producing a written report on your finding. Finally, there are several options to Solver that can allow you to amend/intervene in the solution generating process. The 'Options' button in the Solver dialogue box reveals the dialogue box shown in Figure 8. You can use this to affect how accurate your solution is, how much 'effort' Solver puts into to finding the solution and whether you want to see the results of each iteration. Figure 8: Solver Options The Tolerance option is only required for integer programs (IP), and allows Solver to use 'near integer' values, within the tolerance you specify, and this helps speed up the IP calculations. Checking the Show Iteration Results box allows you to see each step of the calculation, but be warned, if your model is complex this can take an inordinate length of time. Use Automatic Scaling is useful if there is a huge difference in magnitude between your decision variables and the objective value. The bottom three options, Estimates, Derivatives and Search affect the way Solver approaches finding a basic feasible solution, how Solver finds partial differentials of the objective and constraints, and how Solver decides which way to search for the next iteration. Essentially the options affect how solver uses memory and the number of calculations it makes. For most LP problems, they are best left as the default values. The 'Save Model' button is very useful, particularly if you save your model as a named scenario. Clicking this button allows you to assign a name to the current values of your variable cells. This option then allows you to perform further 'what-if' analysis on a variety of possible alternative outcomes - very useful for exploring your model in greater detail. In conclusion, Excel Solver provides a simple, yet effective, medium for allowing you to explore linear programs. You will be using it to solve more complicated linear or even nonlinear programs later in this class. Provide the Excel file containing the three reports (answer, sensitivity and limits reports). (5 points)