Download lab07

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Biochemistry wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Expanded genetic code wikipedia , lookup

Non-coding DNA wikipedia , lookup

Community fingerprinting wikipedia , lookup

Molecular evolution wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Biosynthesis wikipedia , lookup

Point mutation wikipedia , lookup

Genetic code wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Transcript
Computing in the Life Sciences
Term 2, Winter 2016–2017
Lab 07
Count the Nucleotides
Due: Sunday, March 19, 9:00am
MARK: [15]
Name
Number
Account
Name
Partner
Number
Account
Of student
Email
submitted
the lab
Before Lab
Approximate
In Lab
time to finish
After Lab
Submitting
Student
The FASTA format is a simple text file format often used to store sequence data for DNA, RNA
or proteins. Fast, accurate and flexible manipulation of such files is one of the key capabilities
provided by a general purpose programming language, like Python, that pre-packaged
applications cannot match. In this lab you are given a Python program that counts the amino
acids in a protein (as described in a FASTA file), and your task is to modify it so that it will count
the nucleic acids in a DNA sequence (also described in a FASTA file).
Objectives
After this lab you will be able to:





Manually manipulate FASTA files using a text editor.
Read and describe features of an already written Python program.
Read data from a file into a Python program.
Run a Python program that performs a simple analysis of the sequence stored in a
FASTA file.
Modify a Python program written by somebody else and test those modifications.
Version: 30-Apr-17
© 2016–2017 Jessica Dawson, Ian M. Mitchell, George Tsiknis
CPSC 301: Lab 07 -- page 1
1. [2] Get Some Test Data
As you have seen in previous labs, computer programs do not always perform as intended (i.e.
they have bugs). In order to detect and diagnose errors in programs we need to test them on
input for which we already know the correct output.
1.1. Before lab: Read through the description of the FASTA file format on Wikipedia.
1.2. Before lab: Download from the course web site the archive fasta_lab_seqs.zip
and save it into your Lab07/ subdirectory. Unzip it (On Mac, double click. On
Windows, right click and choose "Extract All"). You will find two directories:
dna_sequences/ and protein_sequences/. Inside each directory are 10
FASTA format files containing sequences of the appropriate type. In the rest of this
lab, you will use files called dna_sequence#.fasta in the dna_sequences/
directory and files called protein_sequence#.fasta in the
protein_sequences/ directory of this archive.
1.3. Before lab: Find the files dna_sequence#.fasta, where "#" is the last digit of
your student numbers (yours and your partner). If you and your partner have
different last digits in your student numbers, use those two files; otherwise, use one
file chosen according to your student numbers’ last digit and one according to one of
your student numbers’ first digit. If you are working by yourself, use the first and last
digit of your student number. Copy the two files you picked in the
dna_sequences/ subdirectory and save them in your Lab07/ directory. You will
test your program using these files.
1.4. [/1] Before lab: Open the two files in Spyder's text editor. Spyder's "open file" dialog
window normally only shows Python files (with a .py or similar extension), so you will
probably have to tell it to show all files (*.* or *) before you can open them. Do not
open these files in a word processing application (eg: Microsoft Word). You will notice
that these sequences are long enough that counting the frequency of each nucleotide
by hand would be prone to error and extremely tedious. However, we need test data
for use in debugging our program, so some minimal tedium is unavoidable. For your
two sequences, count the frequencies by hand for the first 30 bases.
Identifier number of the first sequence:
Number of A's:
Number of C's:
Number of G's:
Number of T's:
Version: 30-Apr-17
© 2016–2017 Jessica Dawson, Ian M. Mitchell, George Tsiknis
CPSC 301: Lab 07 -- page 2
Identifier number of the second sequence:
Number of A's:
Number of C's:
Number of G's:
Number of T's:
1.5. [/1] Before lab: Delete all of the sequence data in each file except the first 30 bases of
the sequence (the part that you counted in the previous step). When you are done,
each file should be one identifier line and 30 bases of sequence. Save the modified
files under the names short_sequence#.fasta (where “#” is the same digit as in
the name of the original file) in your Lab07/ directory. Be careful that you save it
with a .fasta extension (Spyder normally defaults to using a .py extension). You
will use these files below for testing (and possibly debugging).
2. [3] Reading a FASTA file
The goal of this lab is to write a program that counts the frequency of nucleic acid bases in a
sequence stored in FASTA format. However, there's no reason to reinvent the wheel, so let's
look at a program with a similar purpose—counting amino acid frequencies in a FASTA file—to
see if we can reuse or modify certain components for use in our nucleic acid counting program.
There are two reasons that code reuse is a good idea:

In a way similar to how we learn to write human languages, one of the best ways to
learn how to write computer code is to read code written by other people. The Python
programming language lends itself particularly well to readable code, which makes it an
excellent place to start (it is also a very powerful and popular language in real-world
bioinformatics applications).

However, unlike written works in a human language, nearly every programmer starts
most new programming tasks by borrowing code written for a different project with a
similar goal (assuming that this is permitted by the author of the code). Consequently,
it is important to be able to read and understand existing code so that we can
determine what (if any) parts of it can be used for our current project.
In our case, we have the aforementioned amino acid counting program to examine. It is
designed to count the “essential” amino acids in a protein sequence FASTA file. The essential
amino acids are: isoleucine (I), leucine (L), lysine (K), methionine (M), phenylalanine (F),
threonine (T), tryptophan (W) and valine (V). Since both amino acids and DNA bases are
encoded in FASTA files using single character codes, this program performs a nearly identical
function to the one that we wish to create, so it's worth spending some time taking a look at it.
Version: 30-Apr-17
© 2016–2017 Jessica Dawson, Ian M. Mitchell, George Tsiknis
CPSC 301: Lab 07 -- page 3
2.1. Before lab: Download the file aa_counter_buggy.py from the course website
into your Lab07/ directory, and then open it in Spyder.
2.2. [/3] Before lab: Read through the function readFasta(). Despite the “_buggy”
at the end of the filename, this function is correctly implemented in the code that we
gave you. Briefly explain how this function decides whether the file is in FASTA format
or not. What lines of the file (give the line numbers) do this job?
Line numbers:
Brief explanation:
Briefly explain what happens if a blank line is encountered while reading the sequence
data. What lines cause this behavior to happen?
Line numbers:
Brief explanation:
The FASTA format allows sequences to continue over multiple lines in the file. If such a
FASTA file is read by readFasta(), will the returned string have multiple lines?
Briefly explain why or why not.
Brief explanation:
3. [2] An Amino Acid Frequency Analysis Program
Now we will experiment with this module as a whole.
Version: 30-Apr-17
© 2016–2017 Jessica Dawson, Ian M. Mitchell, George Tsiknis
CPSC 301: Lab 07 -- page 4
3.1. Before lab: Copy two protein sequence files protein_sequence#.fasta from
the protein_sequences/ subdirectory into your Lab07/ directory, where the
#’s are the same two numbers you used easier for choosing your DNA sequences.
Look at the contents of these files in a text editor and think about what you expect to
see when the number of essential amino acids in each sequence is counted (you need
not count precise numbers – just think about a rough estimate).
3.2. [/1] In lab: Load the file aa_counter_buggy.py into Spyder’s editor, make sure
the Python console is using your Lab07/ directory (if in doubt, start a new console),
and click the green arrow run button. The program will ask you to input the name of a
FASTA file. Type in the name of one of your protein_sequence#.fasta files
and press <Enter>. If an error is reported, check that you typed in the filename
correctly, that the Python interpreter is in the Lab07/ directory (and then start a
new console just in case), and that the FASTA file is in that directory. Once the counts
are displayed, you will probably notice that they are not what you expected. Fix the
code. Briefly explain what the problem is and how you fixed it. Hint: Look at the
docstrings. Are the functions being called appropriately in the main program?
Brief explanation:
3.3. [/1] In lab: Once you have fixed the code, run it for each of your protein sequence
files. What are the identifier numbers for the sequences that you analyzed, and what
is the output of the program? Feel free to cut and paste. Show the TA that you have
managed to fix and run the amino acid counting program.
Identifier number for first sequence (0-9):
Counts:
Version: 30-Apr-17
© 2016–2017 Jessica Dawson, Ian M. Mitchell, George Tsiknis
CPSC 301: Lab 07 -- page 5
Identifier number for second sequence (0-9):
Counts:
4. [4] A Nucleic Acid Frequency Analysis Program
Now that we've had a chance to examine and run the amino acid frequency program, it is time
to modify it to count nucleic acids. Keep in mind that editing the code is only part of what we
have to do; we also need to test the modified code to make sure that it works properly. This
testing and debugging process is one of the most important (and time consuming) parts of
creating programs, so it is worth practising.
4.1. [/1] In lab: Modify count_essential_aa() so that it counts deoxyribonucleic
acids in a DNA sequence FASTA file (the codes are 'A' for adenine, 'C' for cytosine, 'G'
for guanine and 'T' for thymine). Call the resulting function count_dna() and save
the resulting file (which will also include the readFasta() function) as
dna_counter.py so that you can run the program on the DNA test data. Do not
modify the header for either function (apart from renaming
count_essential_aa() to be count_dna()). Briefly describe what changes
you had to make:
Answer :
4.2. In lab: Test your program on the two test sequences short_sequence#.fasta
that you counted by hand. If your program’s results on the test sequences do not
match with the hand counted results, find and eliminate any bugs. Note that these
bugs can be in the code and/or in your hand count.
4.3. [/3] In lab: Run your final program on the original, unmodified versions of
dna_sequence#.fasta. Show the TA your output. What are the counts for each
sequence? Note: Make sure you have tested with your short sequences first and you
Version: 30-Apr-17
© 2016–2017 Jessica Dawson, Ian M. Mitchell, George Tsiknis
CPSC 301: Lab 07 -- page 6
are confident your code is correct: This question is worth 1 point for each correct
sequence count plus 1 point for getting it right the first time you show the TA.
Identifier number for first sequence:
Output for first sequence:
Identifier number for second sequence:
Output for second sequence:
5. [4] Improving the Nucleic Acid Analysis Program
You should now have a functioning analysis program that counts the frequencies of the
nucleotides in a given sequence. You should also be relatively familiar with the inner workings
of the program. You will now add features that may make your analysis program more useful.
5.1. [/2] After lab: Modify count_dna() to calculate the percentage of each nucleotide
in the given sequence. Display these percentages (rounded to one decimal place) along
with all the other output from your counter. Save your modified module as
improved_counter.py. Download counter_test_percent.txt from the
course website. Check that your module improved_counter.py can pass
doctest.testfile('counter_test_percent.txt'). Hint: Look inside
counter_test_percent.txt to see the format expected for the output.
5.2. [/2] After lab: Modify count_dna() to ensure that only sequences containing
definite and valid DNA base character codes are counted. For example, the character
“J” is not a valid code for a nucleic acid. If “J” appears in the FASTA file, your program
should stop counting, print a message to the user that the sequence is invalid, and
return False (Hint: you can use the break command to stop a loop). Save your
modified module as improved_counter.py. Download
counter_test_invalid.txt from the course website. Check that your module
improved_counter.py can pass
doctest.testfile('counter_test_invalid.txt'). Hint: Look inside
counter_test_invalid.txt to see the format expected for the output.
Submission Checklist
Only one person in each group will submit the lab using the handin tool.
Version: 30-Apr-17
© 2016–2017 Jessica Dawson, Ian M. Mitchell, George Tsiknis
CPSC 301: Lab 07 -- page 7
The assignment name you should use with handin to submit this assignment is of the form
“Lab07x” where x is the lower case last letter of your lab section, that is, a for section L2A, b
for section L2B, c for section L2C, d for section L2D, e for section L2E, and f for section L2F. If
you are working with a partner in another section, submit to the section which you attended.
While you may submit multiple times using handin (before the deadline), only your last
submission will be graded. Therefore, you must submit and re-submit all the relevant files in a
single zip archive. For this lab your submission archive should include

A completed version of this lab document. Do not forget to fill in the table at the top
with your identity information, your partner’s identity information, and a rough
estimate of how long each component of the lab took (to the nearest 10 minutes is
fine).

Your solution to the various Python functions: dna_counter.py and
improved_counter.py. (There is no need to submit your bug fixes for
aa_counter_buggy.py, since they are incorporated in dna_counter.py.)

All FASTA and test files that you used (which should not have been modified).
Version: 30-Apr-17
© 2016–2017 Jessica Dawson, Ian M. Mitchell, George Tsiknis
CPSC 301: Lab 07 -- page 8