Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Biochemistry wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Expanded genetic code wikipedia , lookup
Non-coding DNA wikipedia , lookup
Community fingerprinting wikipedia , lookup
Molecular evolution wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Biosynthesis wikipedia , lookup
Point mutation wikipedia , lookup
Genetic code wikipedia , lookup
Computing in the Life Sciences Term 2, Winter 2016–2017 Lab 07 Count the Nucleotides Due: Sunday, March 19, 9:00am MARK: [15] Name Number Account Name Partner Number Account Of student Email submitted the lab Before Lab Approximate In Lab time to finish After Lab Submitting Student The FASTA format is a simple text file format often used to store sequence data for DNA, RNA or proteins. Fast, accurate and flexible manipulation of such files is one of the key capabilities provided by a general purpose programming language, like Python, that pre-packaged applications cannot match. In this lab you are given a Python program that counts the amino acids in a protein (as described in a FASTA file), and your task is to modify it so that it will count the nucleic acids in a DNA sequence (also described in a FASTA file). Objectives After this lab you will be able to: Manually manipulate FASTA files using a text editor. Read and describe features of an already written Python program. Read data from a file into a Python program. Run a Python program that performs a simple analysis of the sequence stored in a FASTA file. Modify a Python program written by somebody else and test those modifications. Version: 30-Apr-17 © 2016–2017 Jessica Dawson, Ian M. Mitchell, George Tsiknis CPSC 301: Lab 07 -- page 1 1. [2] Get Some Test Data As you have seen in previous labs, computer programs do not always perform as intended (i.e. they have bugs). In order to detect and diagnose errors in programs we need to test them on input for which we already know the correct output. 1.1. Before lab: Read through the description of the FASTA file format on Wikipedia. 1.2. Before lab: Download from the course web site the archive fasta_lab_seqs.zip and save it into your Lab07/ subdirectory. Unzip it (On Mac, double click. On Windows, right click and choose "Extract All"). You will find two directories: dna_sequences/ and protein_sequences/. Inside each directory are 10 FASTA format files containing sequences of the appropriate type. In the rest of this lab, you will use files called dna_sequence#.fasta in the dna_sequences/ directory and files called protein_sequence#.fasta in the protein_sequences/ directory of this archive. 1.3. Before lab: Find the files dna_sequence#.fasta, where "#" is the last digit of your student numbers (yours and your partner). If you and your partner have different last digits in your student numbers, use those two files; otherwise, use one file chosen according to your student numbers’ last digit and one according to one of your student numbers’ first digit. If you are working by yourself, use the first and last digit of your student number. Copy the two files you picked in the dna_sequences/ subdirectory and save them in your Lab07/ directory. You will test your program using these files. 1.4. [/1] Before lab: Open the two files in Spyder's text editor. Spyder's "open file" dialog window normally only shows Python files (with a .py or similar extension), so you will probably have to tell it to show all files (*.* or *) before you can open them. Do not open these files in a word processing application (eg: Microsoft Word). You will notice that these sequences are long enough that counting the frequency of each nucleotide by hand would be prone to error and extremely tedious. However, we need test data for use in debugging our program, so some minimal tedium is unavoidable. For your two sequences, count the frequencies by hand for the first 30 bases. Identifier number of the first sequence: Number of A's: Number of C's: Number of G's: Number of T's: Version: 30-Apr-17 © 2016–2017 Jessica Dawson, Ian M. Mitchell, George Tsiknis CPSC 301: Lab 07 -- page 2 Identifier number of the second sequence: Number of A's: Number of C's: Number of G's: Number of T's: 1.5. [/1] Before lab: Delete all of the sequence data in each file except the first 30 bases of the sequence (the part that you counted in the previous step). When you are done, each file should be one identifier line and 30 bases of sequence. Save the modified files under the names short_sequence#.fasta (where “#” is the same digit as in the name of the original file) in your Lab07/ directory. Be careful that you save it with a .fasta extension (Spyder normally defaults to using a .py extension). You will use these files below for testing (and possibly debugging). 2. [3] Reading a FASTA file The goal of this lab is to write a program that counts the frequency of nucleic acid bases in a sequence stored in FASTA format. However, there's no reason to reinvent the wheel, so let's look at a program with a similar purpose—counting amino acid frequencies in a FASTA file—to see if we can reuse or modify certain components for use in our nucleic acid counting program. There are two reasons that code reuse is a good idea: In a way similar to how we learn to write human languages, one of the best ways to learn how to write computer code is to read code written by other people. The Python programming language lends itself particularly well to readable code, which makes it an excellent place to start (it is also a very powerful and popular language in real-world bioinformatics applications). However, unlike written works in a human language, nearly every programmer starts most new programming tasks by borrowing code written for a different project with a similar goal (assuming that this is permitted by the author of the code). Consequently, it is important to be able to read and understand existing code so that we can determine what (if any) parts of it can be used for our current project. In our case, we have the aforementioned amino acid counting program to examine. It is designed to count the “essential” amino acids in a protein sequence FASTA file. The essential amino acids are: isoleucine (I), leucine (L), lysine (K), methionine (M), phenylalanine (F), threonine (T), tryptophan (W) and valine (V). Since both amino acids and DNA bases are encoded in FASTA files using single character codes, this program performs a nearly identical function to the one that we wish to create, so it's worth spending some time taking a look at it. Version: 30-Apr-17 © 2016–2017 Jessica Dawson, Ian M. Mitchell, George Tsiknis CPSC 301: Lab 07 -- page 3 2.1. Before lab: Download the file aa_counter_buggy.py from the course website into your Lab07/ directory, and then open it in Spyder. 2.2. [/3] Before lab: Read through the function readFasta(). Despite the “_buggy” at the end of the filename, this function is correctly implemented in the code that we gave you. Briefly explain how this function decides whether the file is in FASTA format or not. What lines of the file (give the line numbers) do this job? Line numbers: Brief explanation: Briefly explain what happens if a blank line is encountered while reading the sequence data. What lines cause this behavior to happen? Line numbers: Brief explanation: The FASTA format allows sequences to continue over multiple lines in the file. If such a FASTA file is read by readFasta(), will the returned string have multiple lines? Briefly explain why or why not. Brief explanation: 3. [2] An Amino Acid Frequency Analysis Program Now we will experiment with this module as a whole. Version: 30-Apr-17 © 2016–2017 Jessica Dawson, Ian M. Mitchell, George Tsiknis CPSC 301: Lab 07 -- page 4 3.1. Before lab: Copy two protein sequence files protein_sequence#.fasta from the protein_sequences/ subdirectory into your Lab07/ directory, where the #’s are the same two numbers you used easier for choosing your DNA sequences. Look at the contents of these files in a text editor and think about what you expect to see when the number of essential amino acids in each sequence is counted (you need not count precise numbers – just think about a rough estimate). 3.2. [/1] In lab: Load the file aa_counter_buggy.py into Spyder’s editor, make sure the Python console is using your Lab07/ directory (if in doubt, start a new console), and click the green arrow run button. The program will ask you to input the name of a FASTA file. Type in the name of one of your protein_sequence#.fasta files and press <Enter>. If an error is reported, check that you typed in the filename correctly, that the Python interpreter is in the Lab07/ directory (and then start a new console just in case), and that the FASTA file is in that directory. Once the counts are displayed, you will probably notice that they are not what you expected. Fix the code. Briefly explain what the problem is and how you fixed it. Hint: Look at the docstrings. Are the functions being called appropriately in the main program? Brief explanation: 3.3. [/1] In lab: Once you have fixed the code, run it for each of your protein sequence files. What are the identifier numbers for the sequences that you analyzed, and what is the output of the program? Feel free to cut and paste. Show the TA that you have managed to fix and run the amino acid counting program. Identifier number for first sequence (0-9): Counts: Version: 30-Apr-17 © 2016–2017 Jessica Dawson, Ian M. Mitchell, George Tsiknis CPSC 301: Lab 07 -- page 5 Identifier number for second sequence (0-9): Counts: 4. [4] A Nucleic Acid Frequency Analysis Program Now that we've had a chance to examine and run the amino acid frequency program, it is time to modify it to count nucleic acids. Keep in mind that editing the code is only part of what we have to do; we also need to test the modified code to make sure that it works properly. This testing and debugging process is one of the most important (and time consuming) parts of creating programs, so it is worth practising. 4.1. [/1] In lab: Modify count_essential_aa() so that it counts deoxyribonucleic acids in a DNA sequence FASTA file (the codes are 'A' for adenine, 'C' for cytosine, 'G' for guanine and 'T' for thymine). Call the resulting function count_dna() and save the resulting file (which will also include the readFasta() function) as dna_counter.py so that you can run the program on the DNA test data. Do not modify the header for either function (apart from renaming count_essential_aa() to be count_dna()). Briefly describe what changes you had to make: Answer : 4.2. In lab: Test your program on the two test sequences short_sequence#.fasta that you counted by hand. If your program’s results on the test sequences do not match with the hand counted results, find and eliminate any bugs. Note that these bugs can be in the code and/or in your hand count. 4.3. [/3] In lab: Run your final program on the original, unmodified versions of dna_sequence#.fasta. Show the TA your output. What are the counts for each sequence? Note: Make sure you have tested with your short sequences first and you Version: 30-Apr-17 © 2016–2017 Jessica Dawson, Ian M. Mitchell, George Tsiknis CPSC 301: Lab 07 -- page 6 are confident your code is correct: This question is worth 1 point for each correct sequence count plus 1 point for getting it right the first time you show the TA. Identifier number for first sequence: Output for first sequence: Identifier number for second sequence: Output for second sequence: 5. [4] Improving the Nucleic Acid Analysis Program You should now have a functioning analysis program that counts the frequencies of the nucleotides in a given sequence. You should also be relatively familiar with the inner workings of the program. You will now add features that may make your analysis program more useful. 5.1. [/2] After lab: Modify count_dna() to calculate the percentage of each nucleotide in the given sequence. Display these percentages (rounded to one decimal place) along with all the other output from your counter. Save your modified module as improved_counter.py. Download counter_test_percent.txt from the course website. Check that your module improved_counter.py can pass doctest.testfile('counter_test_percent.txt'). Hint: Look inside counter_test_percent.txt to see the format expected for the output. 5.2. [/2] After lab: Modify count_dna() to ensure that only sequences containing definite and valid DNA base character codes are counted. For example, the character “J” is not a valid code for a nucleic acid. If “J” appears in the FASTA file, your program should stop counting, print a message to the user that the sequence is invalid, and return False (Hint: you can use the break command to stop a loop). Save your modified module as improved_counter.py. Download counter_test_invalid.txt from the course website. Check that your module improved_counter.py can pass doctest.testfile('counter_test_invalid.txt'). Hint: Look inside counter_test_invalid.txt to see the format expected for the output. Submission Checklist Only one person in each group will submit the lab using the handin tool. Version: 30-Apr-17 © 2016–2017 Jessica Dawson, Ian M. Mitchell, George Tsiknis CPSC 301: Lab 07 -- page 7 The assignment name you should use with handin to submit this assignment is of the form “Lab07x” where x is the lower case last letter of your lab section, that is, a for section L2A, b for section L2B, c for section L2C, d for section L2D, e for section L2E, and f for section L2F. If you are working with a partner in another section, submit to the section which you attended. While you may submit multiple times using handin (before the deadline), only your last submission will be graded. Therefore, you must submit and re-submit all the relevant files in a single zip archive. For this lab your submission archive should include A completed version of this lab document. Do not forget to fill in the table at the top with your identity information, your partner’s identity information, and a rough estimate of how long each component of the lab took (to the nearest 10 minutes is fine). Your solution to the various Python functions: dna_counter.py and improved_counter.py. (There is no need to submit your bug fixes for aa_counter_buggy.py, since they are incorporated in dna_counter.py.) All FASTA and test files that you used (which should not have been modified). Version: 30-Apr-17 © 2016–2017 Jessica Dawson, Ian M. Mitchell, George Tsiknis CPSC 301: Lab 07 -- page 8