Download Problem Set 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Artificial gene synthesis wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

NEDD9 wikipedia , lookup

Transcript
Problem Set 1
Please make sure to show your work and calculations and state any assumptions you
make in answering the following questions. Include the names of the people you worked
with at the top of your problem set.
I. Biology (35 Points total)
1 DNA and RNA structure: Nucleic acid polymers are the basis for genetic information
storage and transfer in the cell (5 points)
1.1 What is the monomeric unit of DNA called? (1 point)
1.2 What is the monomeric unit of RNA called? (1 point)
1.3 What is the generic term for both of these units? (1 point)
1.4 DNA is usually present in the cell in the form of a double-helix. Explain what
this structure is and why it is important to the function of DNA. (Keywords:
Replication, redundancy, anti-parallel, complementary base pairs.) (2 points)
2 Proteins are polymers that perform the intended function of most genes (5 points)
2.1 What is the monomeric unit of a protein? (1 point)
2.2 How many different types of monomers typically exist? (1 point)
2.3 Due to the nature of the direction of protein synthesis and the structure of
proteins a protein is usually referred to as having an N and C-terminal end. Why
are the symbols N and C used? (1 point)
2.4 Proteins are described as having four levels of structure (primary-, secondary-,
tertiary-, and quaternary-structure). For each of the following list the category (or
categories) they belong to. Note that some may be belong to more than one
category. (2 points)
2.4.1 Amino Acid Sequence?
2.4.2 Alpha Helices?
2.4.3 HIV protease dimer?
2.4.4 Disulfide bonds?
3 An understanding of the central dogma and the structure of DNA, RNA and proteins
will help you answer these questions (10 points)
3.1 In the eukaryotic cell the size of the mature mRNA that is translated is smaller
than the gene sequence. Why is this the case? (Keywords: transcription,
Promoter, Exon, Intron, Splicing)
3.2 For a given expressed gene, the length (in monomers) of the resultant protein
(assuming no post processing), is less than 1/3 the size of the mature mRNA.
Why? (Keywords: Tri-Nucleotide Codon, Start Codon, Stop Codon, t-RNA, 3’
and 5’ UTR, ORF.)
4 You will need to understand the genetic code to answer these questions. (5 Points)
4.1 What six codons encode for Serine? (1 point)
4.2 List all of the codons that do not code for an amino acid. What is their purpose?
(2 points)
4.3 What amino acid does the codon ATG encode for? Other than coding for an
amino acid, does this codon perform any special function in protein
biosyntheses? (2 points)
5 Eukaryotic and prokaryotic organisms differ in many aspects. For each of the cellular
structures or characteristics listed below, please identity whether it belongs to
eukaryotic or prokaryotic organisms or both. (5 points)
5.1 Membrane-bound organelles (1 point)
5.2 Nucleus (1 point)
5.3 70S ribosome (1 point)
5.4 RNA splicing (1 point)
5.5 microRNAs (1 point)
6 Speculate on a biological problem that might be interesting to investigate with
computational methods. Think of this as a possible subject for your final project (5
Points)
II. Perl Program (35 points total + 5 bonus points)
Please submit your code and output in separate files.
You are working in a lab investigating the properties of the SARS virus. This virus was
recently sequenced and identified to be a member of the corona virus family. One of
your labmates has recently received frozen respiratory tract cell samples taken from
Toronto-area patients either suspected or confirmed to have SARS. Initial PCR-based
tests by the clinicians in Toronto-area hospitals suggest that a new strain has emerged –
while its disease pathology is similar to that observed for the original strain isolated in
China, the mortality rate is two-fold higher than the original strain.
Your lab is in the middle of preparing samples from the new strain for sequencing. In the
meantime, your advisor has asked you to prepare a software tool that would identify
putative ORFs and design oligonucleotide probes to add to your core facility's human
DNA microarrays. In order to craft this program, you will start with the existing SARS
sequence so that when your lab finishes the sequencing, you will be able to quickly
generate the necessary oligos for the new strain.
Your core facility uses 70-mers for its human oligonucleotide microarray probe set, with
a mean melting temperature (Tm) of 67-69 °C.
As an initial step for this process, write a Perl script that takes the existing sequence data
of the SARS genome and generates a non-overlapping list of 70-mer oligonucleotides
within the specified Tm window.
You can use the skeleton code to help get you started. Your code should be able to do the
following:
1. Calculate GC content for a test 70-mer (10 points for this section):
a. In order to do this, you will need the following line of code:
$c = $oligo =~ s/c//gi;
Explain what this line does as a comment in your Perl script (hint in
skeleton). (5 points).
b. Output the full oligo sequence and its GC content to the screen (5 points).
2. Calculate Tm (described in the skeleton code) for a test 70-mer. (5 points)
3. Read in the SARS genome and parse through all possible 70-mers, calculate GC
content and Tm for each 70-mer in the SARS genome, filter out 70-mers that don’t
satisfy the Tm requirement (between 67-69 °C), and store the filtered results in an
array variable (15 points).
4. The oligos you obtained from above may be overlapping with each other. Here
you’re asked to filter out the overlapping ones and output a list of nonoverlapping oligos with the starting position, oligonucleotide sequence, GC
content, and Tm. You should have a total of four tab-separated columns. (5 points)
Hint: to output the STDOUT (i.e. the screen) into a file, reroute the output into
the file with the following syntax:
program.pl [switches] > output.txt
Bonus: an important factor in oligo design is to mask repetitive sequences to minimize
non-specific hybridization. Most oligo design programs have a fairly comprehensive set
of repetitive sequences that are masked from the oligo design space. Here, filter your list
of results by removing oligonucleotides that have a homo-polynucleotide tract 5 or more
bases in length. Note that you should perform your filtering on the list of all possible
qualifying oligos, not the list of non-overlapping oligos (5 points).
III. Excel tutorial (30 points total + 5 bonus points)
This exercise is designed to get you used to using Excel for general data analysis tasks.
You will need some data to work with. You will be looking at genomic expression
profiles used to classify cancer types. Don’t worry about how they were created or what
they mean you will learn more about expression profiling later. If you want to learn more
about the data look here http://wwwgenome.wi.mit.edu/mpr/publications/projects/Leukemia/Files_descriptions.txt. Download
the following files.
http://wwwgenome.wi.mit.edu/mpr/publications/projects/Leukemia/table_ALL_AML_samples.txt
http://wwwgenome.wi.mit.edu/mpr/publications/projects/Leukemia/data_set_ALL_AML_train.txt
Open table_ALL_AML_samples.txt and data_set_ALL_AML_train.txt in excel and use
the Text Import Wizard to parse the Tab delimitated data.
Push Next
Push Finish
Now merge the two separate work books into one workbook. Go to the Excel window
named table_ALL_AML_samples.txt and right click the tab at the bottom. Select “Move
or Copy” and Move the selected sheet to the data_set_ALL_AML_train.txt book.
Save merged work sheet as Excel Type and name ps1_firstname_lastname.xls. The
worksheet table_ALL_AML_samples describes all of samples in the study. You need to
discriminate between ALL and AML samples in the training set (INITIAL SET). Use this
sheet to determine which samples are which. The worksheet data_set_ALL_AML_train
contains all of the expresson data for each of the samples. The first two columns are Gene
Descirption and Gene Accession. Every paired column after that represents gene
expression and call values for that sample. Look at row 1946. The gene name is TNFR2
Tumor necrosis factor receptor 2 (75kD). Cell C1946 is the expression value for sample 1
and cell D1946 is the call value. The three different values for calls are A (Gene is
Absent), P (Gene is present) and M (The call is marginal).
Insert two new worksheets into the workbook. Name them
data_set_ALL_AML_train_exp and data_set_ALL_AML_train_call. Copy the data from
data_set_ALL_AML_train to each of these new worksheets. For worksheets
data_set_ALL_AML_train_exp delete every column with header call. For worksheet
data_set_ALL_AML_train_call rename the call header with the previous sample ID
number. And then delete the columns containing expression values.
Go to data_set_ALL_AML_train_exp. We are going to apply a two population Distance
Measurement to the ALL and AML expression values. First we will try the Students tTest. Go to cell AO1 should be the first blank cell if you deleted the call columns. Type tTest in that column. Go to the menu Tool -> Add-Ins … and select Analysis ToolPak. Go
to cell AO2 and insert the function TTEST by using the Insert -> Function dialog. Array1
will contain the values for the row 2 genes and ALL samples, Array2 will contain the
values for the row 2 genes and AML samples. Tails will contain 1 for a 1 tailed
distribution and type is 3 for two-sample unequal variance. It should look like this.
Now we will try another distance measurement called Signal-to-noise or S2N. For a
description see http://www-genome.wi.mit.edu/cancer/software/genecluster2/gc_ref.html.
The Signal-to-Noise measure the difference of the means in each of the classes scaled by
the sum of the standard deviations:
S2N = ( 1 -  2) / ( 1 +  2)
 1 is the mean of class 1 and  1 is the standard deviation of class 1. Apply this formula
to the data in a similar manner as you did with the t-Test. You can use the STDEV and
AVERAGE functions.
No go to worksheet data_set_ALL_AML_train_call. Create three columns to the far left
titled “Count_present_ALL”, “Count_present_ALM” and Hyper_Geometric. In
Count_present_ALL use the Excel COUNTIF function to count the number of present
calls in ALL for each gene and in Count_present_ALM do the same for ALM. Now in
the column Hyper_Geometric use the Excel function HYPGEOMDIST to calculate the
hypergeometric distribution for each gene. Don’t worry too much about hypergeometric
distribution as you will get more background on that later. Look at the screen shot below
to get a hint of how to use this function.
List each of the following in a new Excel sheet. Make sure you keep the separate entries
separate.
a) The six rows, including Gene Description, with lowest t-Test values. (10 points)
b) The three highest and the three lowest (Most Negative) S2N rows. (10 points)
c) The six rows with the lowest hyper geometric distribution. (10 points)
Do you see any common genes in the three lists above?
Submit only the data you created for a), b) and c) with your homework.
Bonus: Linear Programming using Excel (5 points)
Please note that you only need to do one of the following two bonus problems (either A
or B) in order to get the extra credit. We recommend Bonus B since it’s more
biologically relevant.
Bonus A:
In a later problem set, you’ll be asked to solve a linear programming problem
using Excel. To prepare yourself on this, you will use the on line tutorial Teaching Linear
Programming using Microsoft Excel Solver, by Ziggy MacDonald at the University of
Leicester. The tutorial can be accessed here
http://www.economics.ltsn.ac.uk/cheer/ch9_3/ch9_3p07.htm. Provide the sensitivity
report in Excel format as proof that you did the exercise. (5 points)
Note that there are typos on that website: the objective function should be 3x1 + 5x2 (in
the problem formulation, before Figure 1) and 3*B9+5*B10 (on entering in Excel, right
after Figure 1).
Bonus B:
Using Excel to Solve the Flux Balance Example in Lecture 2
In this tutorial, you are going to learn how a linear program (LP), for example, the flux
balance example in Lecture 2, can be solved using the LP solver in Microsoft Excel.
In Lecture 2, the basic concept of flux balance analysis was illustrated through a simple
example involving the following metabolic network. A is a nutrient (raw material). After
it enters the cell, A can be converted to either B + C or B + D. B, C, and D exit the cell as
products. The input flux of A is limited to 1 mol/sec. Given that the (relative) values of C
and D are 1 and 3, the objective is to maximize the total value.
Figure 1: Metabolic network for the FBA example
Using only two decision variables, this problem can be formulated as the following LP:
maximize
x1 + 3 x2
(objective)
subject to
where
x1 + x2 <= 1
(limit of input flux)
x1, x2 >=
(Non-negativity requirements)
0
x1 = flux of the reaction converting A to B and C
x2 = flux of the reaction converting A to B and D.
Having formulated the problem, and yours in the future may have substantially more
decision variables and constraints, you can then use Excel to solve it. First, you need to
make sure that the Solver add-in is installed in your Excel. This feature is installed if you
can see the Solver option in the Tools menu. If Excel is setup on your machine by a
default installation, this feature is usually not included. Then you need to add it by
carrying out the following once-only steps:
1. Select the menu option Tools | Add_Ins (this will take a few moments to load the
necessary file).
2. From the dialogue box presented check the box for Solver Add-In.
3. On clicking OK, you will then be able to access the Solver option from the new
menu option Tools | Solver.
Now you can start entering the LP into Excel. The best approach to entering the problem
into Excel is first to list in a column the names of the objective function, decision
variables and constraints. You can then enter some arbitrary starting values in the cells
for the decision variables, usually zero, as shown below. Excel will vary the values of the
cells as it determines the optimal solutions. Having assigned the decision variables with
some arbitrary starting values you can then use these cell references explicitly in writing
the formulae for the objective function and constraints, remembering to start each
formula with an '=' .
Figure 2: Setting up the problem in Excel
The objective function in B5 will be given by:
=B9+3*B10
The constraints will be given by (putting the right hand side {RHS} values in the adjacent
cells):
Input limit
Non-neg 1
Non neg 2
(B14)
(B15)
(B16)
=B9+B10
=B9
=B10
You are now ready to use Solver.
On selecting the menu option Tools | Solver the dialogue box shown in Figure 3 is
revealed, and if you select the objective cell before invoking Solver the correct Target
Cell will be identified. This is the value Solver will attempt either to maximize or
minimize
Figure 3: The Solver Dialogue Box
Select whether you wish to minimize this or maximize the problem, in this case you
would want to set the target cell (the objective) to a Max. Note that you can use Solver to
find the outcome that will achieve a specified value for the target cell by clicking 'Value
of:'. In doing this you can use Solver as a glorified goal seeker. Next you enter the range
of cells you want Solver to vary, the decision variables. Click on the white box and select
cells B9 & B10, or alternatively type them in. Note that you can try to get Solver to guess
which cells you want to vary by clicking the 'Guess' button. If you have defined your
problem in a logical way Solver should usually get these right.
You can now enter the constraints by first clicking the 'Add ..' button. This reveals the
dialogue box shown in Figure 4.
Figure 4: Entering Constraints
The cell reference is to the cell containing your constraint formula, so for the first
constraint you enter B14. By default <= is selected but you can change this by clicking on
the drop down arrow to reveal a list of other constraint types. In the right hand white box
you enter the cell reference to the cell containing the RHS value, which for the first
constraint is cell C14. You then click 'Add' to add the rest of the constraints,
remembering to include the non-negativity constraints.
Having added all the constraints, click 'OK' and the Solver dialogue box should look like
that shown in Figure 5.
Figure 5: The Completed Solver Dialogue Box
Before clicking 'Solve' it is good practice when doing LPs to go into the Options and
check the 'Assume Linear Model' box, unless, of course, your model isn't linear (Solver
can handle most mathematical program types, including non-linear and integer
problems). Doing this can speed up the length of time taken for Solver to find a solution
to the problem and in fact, it will also ensure the correct result and quite importantly,
provide the relevant sensitivity report. Having selected this option you are now ready to
Click 'Solve' and see Solver find the optimal values for x1 and x2. On doing this, at the
bottom of the screen Excel will inform you of Solver's progress, then on finding an
optimal solution the dialogue box shown in Figure 6 will appear. You will also observe
that Solver has altered all the values in your spreadsheet, replacing them with the optimal
results.
You can use the Solver Results dialogue box to generate three reports. To select all three
at once, click each one in turn.
Figure 6: Solver Results
At the same time it's often a good idea to get Solver to restore your original values in the
spreadsheet so that you can return to the original problem formulation and make
adjustments to the model such as altering the availability of resources. The three reports
are generated in new sheets in the current workbook of Excel.
The Answer Report, as shown in Figure 7, gives details of the solutions (in this case, the
objective is maximized at 3 when x2=1 and x1=0) and information concerning the status
of each constraint with accompanying slack/surplus values is provided.
Figure 7: Answer Report
The Sensitivity Report provides information about how sensitive your solution is to
changes in the constraints. The report is fairly standard, providing information on shadow
values, reduced cost and the upper and lower limits for the decision variables and
constraints. The Limits Report also provides sensitivity information on the RHS values.
All the reports can simply be copied and pasted into Word and this is perhaps one of the
big advantages of using Excel over a DOS based LP solver. Although the reports paste
into Word as tables, they are easily converted into text and can then be manipulated if
one is producing a written report on your finding.
Finally, there are several options to Solver that can allow you to amend/intervene in the
solution generating process. The 'Options' button in the Solver dialogue box reveals the
dialogue box shown in Figure 8. You can use this to affect how accurate your solution is,
how much 'effort' Solver puts into to finding the solution and whether you want to see the
results of each iteration.
Figure 8: Solver Options
The Tolerance option is only required for integer programs (IP), and allows Solver to use
'near integer' values, within the tolerance you specify, and this helps speed up the IP
calculations. Checking the Show Iteration Results box allows you to see each step of the
calculation, but be warned, if your model is complex this can take an inordinate length of
time. Use Automatic Scaling is useful if there is a huge difference in magnitude between
your decision variables and the objective value.
The bottom three options, Estimates, Derivatives and Search affect the way Solver
approaches finding a basic feasible solution, how Solver finds partial differentials of the
objective and constraints, and how Solver decides which way to search for the next
iteration. Essentially the options affect how solver uses memory and the number of
calculations it makes. For most LP problems, they are best left as the default values.
The 'Save Model' button is very useful, particularly if you save your model as a named
scenario. Clicking this button allows you to assign a name to the current values of your
variable cells. This option then allows you to perform further 'what-if' analysis on a
variety of possible alternative outcomes - very useful for exploring your model in greater
detail.
In conclusion, Excel Solver provides a simple, yet effective, medium for allowing you to
explore linear programs. You will be using it to solve more complicated linear or even
nonlinear programs later in this class.
Provide the Excel file containing the three reports (answer, sensitivity and limits reports).
(5 points)