Download Lab 9

COM110 Lab 9: Classes, Objects, Genetics Application In class we looked at pieces of DNA molecules, which were coded as sequences of bases: A, T, G, or C. We created a class called BSequence that read in a file, named the species and created a list of the base sequence. Recall that in a base sequence there may occur many genes and also gaps of junk where the coding does not matter. In this lab we will create several instances of BSequences. Some of these may be genes. Genes are like BSequences in that they have a sequence of bases. The same gene (perhaps in slightly altered form) can be found in many different organisms and similarity of genes (if not the same gene) can tell us about the evolutionary relationships among organisms. Measuring the degree of similarity is an interesting problem; for our purposes we will want to know when identical genes are contained in different organisms and so we will be concentrating on bacteria. So one question is whether a gene is found in a particular BSequence. Recall that BSequences can be read in from the NCBI (National Center for Biotechnology Information) web site at http://www.ncbi.nlm.nih.gov/mapview/ Fasta sequences and Python programs are in the Lab9 folder.. How does one recognize a gene and what are the implications for the production of protein? Proteins contain combinations of up to 20 amino acids, which are based on the codes in the RNA transcribed from a DNA sequence. One gets to the RNA sequence by exchanging every T (thymine) for a U (uracil). Different amino acids are identified by different codons, which are triplets of bases (for example, AGC is a codon). There are 64 possible codons (4x4x4) of the 4 bases (T, G, C, A). One particular codon is a start codon and three of the codons (triplets) are stop codons. The rest of the codons are amino acids. Several codons can code the same amino acid. Below is a table giving all the 64 possible triplets of the 4 bases and their corresponding codes (start, stop or one of the 20 amino acids). If we have a BSequence how do we know when a gene starts? The clue is the specific sequence TATAA, followed (at some later point) by the translation start codon (ATG). From that point on, triplets are read until one of the stop codons is found. Each of the intermediate codons identifies an amino acid. We will see if an arbitrary BSequence contains a gene and if it does, we will count the numbers of different amino acids or codons (in a dictionary). We will also write a method to identify how many amino acids there are of each type in a given sequence. The BSequences we will look at will be the complete genomes for several species of bacteria. For each one we will look for a TATAA sequence, then look for an ATG sequence after it and from that point on start counting the different codons (triplets) we find until we reach one of the three stop codons. The computer structures we will be looking at are a Class for the BSequence (contained in the module BSequenceClass.py); a Class for Genes, called Gene, which you will create, and which will be a subclass of BSequence; and a Class called GenomeDatabase, which will have a list of BSequence objects. Beginning code can be found in the modules BSequence.py and GenomeDatabase.py in the lab9 folder in the Class folder. This table shows the 64 codons and the amino acid each codon codes for. 2nd base U 1st C A G UUU (Phe/F)Phenylalanine UCU (Ser/S)Serine UAU (Tyr/Y)Tyrosine UGU (Cys/C)Cysteine U UUC (Phe/F)Phenylalanine UCC (Ser/S)Serine UAC (Tyr/Y)Tyrosine UGC (Cys/C)Cysteine UUA (Leu/L)Leucine UCA (Ser/S)Serine UAA Ochre (Stop) UGA Opal (Stop) UUG (Leu/L)Leucine UCG (Ser/S)Serine UAG Amber (Stop) UGG (Trp/W)Tryptophan CUU (Leu/L)Leucine CCU (Pro/P)Proline CAU (His/H)Histidine CGU (Arg/R)Arginine C CUC (Leu/L)Leucine CCC (Pro/P)Proline CAC (His/H)Histidine CGC (Arg/R)Arginine CUA (Leu/L)Leucine CCA (Pro/P)Proline CAA (Gln/Q)Glutamine CGA (Arg/R)Arginine CUG (Leu/L)Leucine CCG (Pro/P)Proline CAG (Gln/Q)Glutamine CGG (Arg/R)Arginine AUU (Ile/I)Isoleucine ACU (Thr/T)Threonine AAU (Asn/N)Asparagine AGU (Ser/S)Serine A AUC (Ile/I)Isoleucine ACC (Thr/T)Threonine AAC (Asn/N)Asparagine AGC (Ser/S)Serine AUA (Ile/I)Isoleucine ACA (Thr/T)Threonine AAA (Lys/K)Lysine AGA (Arg/R)Arginine AUG (Met/M)Methionine (Start) ACG (Thr/T)Threonine AAG (Lys/K)Lysine AGG (Arg/R)Arginine GUU (Val/V)Valine GCU (Ala/A)Alanine GAU (Asp/D)Aspartic acid GGU (Gly/G)Glycine G GUC (Val/V)Valine GCC (Ala/A)Alanine GAC (Asp/D)Aspartic acid GGC (Gly/G)Glycine GUA (Val/V)Valine GCA (Ala/A)Alanine GAA (Glu/E)Glutamic acid GGA (Gly/G)Glycine GUG (Val/V)Valine GCG (Ala/A)Alanine GAG (Glu/E)Glutamic acid GGG (Gly/G)Glycine base Source: http://en.wikipedia.org/wiki/Genetic_code 1) Find a gene in a BSequence: Add a method to BSequence called findGene, which will return the first gene it finds or an empty string otherwise. Initialize gene to ‘’. In the BSequence, look for the codon sequence ‘TATAA’ followed (at some point) by ‘ATG’. If found, call that index start, initialize a counter i at start+3, and set a codon to the next three elements codon=self.seq[i:i+3] Then, in a while loop, as long as codon is not one of the three stopping codons (be sure to exchange T for U), move i forward by 3, get the next codon and go back up to the while again. The s.find() function should come in handy. At the end of the while loop be sure to assign the subsequence you just found to gene. Return the gene and test it in the main program. Be sure to add a docstring for this function. ☺Get check 1 2) Creation of subclasses (inheritance): We will also look at genes by reading them in from files. Expand the BSequence module by adding a new class called Gene, which will be a subclass of BSequence (this is inheritance). It will use the BSequence constructor but also have a speciesList, which will be initialized to []. Later on we will add elements to this list (if those species have this gene). Write code for this constructor. Now, in order to add species to the list for the gene, write a method called addSpecies that takes in a string and adds it to the speciesList for the gene. Now write a printInfo() method for Gene that uses the printInfo method from BSequence but also prints out the speciesList. In a main function test out this gene by bringing in the file rpoNGene.fasta (this gene is present in many microbes; in some it regulates nitrogen fixation genes and in others it regulates responses to oxygen). At this point you can add an arbitrary species name to test it. In your work be sure to put in docstrings for modules, classes and methods (in triple quotes). ☺Get check 2 3) Write the method findAminos: In the class for Gene, write a method called findAminos constructs a dictionary that counts the different codons (called histo) and a dictionary that counts some of the amino acids (called aminos) and adds them to the Gene Class. In the method, first initialize self.histo to {} and self.aminos to {}. In a while loop go through the geneSeq by 3’s and for each codon, put it in the dictionary: if codon in self.histo: self.histo[codon]+=1 else: self.histo[codon]=1 Now the self.histo dictionary will have a count of all the codons in the gene. Using the table above and the histo dictionary just created, count at least five of the amino acids and put them in the dictionary self.aminos. For example, Tryosine appears with two codes and so we would have (don’t forget to change Ts for Us): Self.aminos['Tyrosine']= self.histo.get('TAT',0)+ self.histo.get('TAC',0) Add these two dictionaries to the printInfo method. Try calling the new method with e_coli. Be sure to print out the dictionaries. ☺Get checks 3 and 4 4) Creating another method: Open the module GenomeDatabse.py and try out its methods Add a boolean method to GenomeDatabse called checkSubseq that takes a BSequence object and a Gene object as parameters and returns whether or not the gene’s sequence is a subsequence of the BSequence (ie one is contained in the other). If it is, then the method should also add the BSequence species name to the speciesList of the gene. You should add code in main() that will test whether the rpon gene is a subsequence of any of the other sequences (e_coli, rhizobium). ☺Get check 5

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lab 9