Download Lab 9

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomic imprinting wikipedia , lookup

Transposable element wikipedia , lookup

Copy-number variation wikipedia , lookup

Real-time polymerase chain reaction wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Molecular ecology wikipedia , lookup

Gene expression wikipedia , lookup

Biochemistry wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene therapy wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Promoter (genetics) wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene desert wikipedia , lookup

Point mutation wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene wikipedia , lookup

Gene regulatory network wikipedia , lookup

Community fingerprinting wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Biosynthesis wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genetic code wikipedia , lookup

Transcript
COM110 Lab 9: Classes, Objects, Genetics Application
In class we looked at pieces of DNA molecules, which were coded as sequences of bases: A, T, G, or C. We
created a class called BSequence that read in a file, named the species and created a list of the base sequence.
Recall that in a base sequence there may occur many genes and also gaps of junk where the coding does not
matter. In this lab we will create several instances of BSequences. Some of these may be genes. Genes are like
BSequences in that they have a sequence of bases. The same gene (perhaps in slightly altered form) can be
found in many different organisms and similarity of genes (if not the same gene) can tell us about the
evolutionary relationships among organisms. Measuring the degree of similarity is an interesting problem; for
our purposes we will want to know when identical genes are contained in different organisms and so we will be
concentrating on bacteria. So one question is whether a gene is found in a particular BSequence. Recall that
BSequences can be read in from the NCBI (National Center for Biotechnology Information) web site at
http://www.ncbi.nlm.nih.gov/mapview/ Fasta sequences and Python programs are in the Lab9 folder..
How does one recognize a gene and what are the implications for the production of protein? Proteins contain
combinations of up to 20 amino acids, which are based on the codes in the RNA transcribed from a DNA
sequence. One gets to the RNA sequence by exchanging every T (thymine) for a U (uracil). Different amino
acids are identified by different codons, which are triplets of bases (for example, AGC is a codon). There are
64 possible codons (4x4x4) of the 4 bases (T, G, C, A). One particular codon is a start codon and three of the
codons (triplets) are stop codons. The rest of the codons are amino acids. Several codons can code the same
amino acid. Below is a table giving all the 64 possible triplets of the 4 bases and their corresponding codes
(start, stop or one of the 20 amino acids). If we have a BSequence how do we know when a gene starts? The
clue is the specific sequence TATAA, followed (at some later point) by the translation start codon (ATG). From
that point on, triplets are read until one of the stop codons is found. Each of the intermediate codons identifies
an amino acid. We will see if an arbitrary BSequence contains a gene and if it does, we will count the numbers
of different amino acids or codons (in a dictionary). We will also write a method to identify how many amino
acids there are of each type in a given sequence. The BSequences we will look at will be the complete genomes
for several species of bacteria. For each one we will look for a TATAA sequence, then look for an ATG
sequence after it and from that point on start counting the different codons (triplets) we find until we reach one
of the three stop codons.
The computer structures we will be looking at are a Class for the BSequence (contained in the module
BSequenceClass.py); a Class for Genes, called Gene, which you will create, and which will be a subclass of
BSequence; and a Class called GenomeDatabase, which will have a list of BSequence objects. Beginning code
can be found in the modules BSequence.py and GenomeDatabase.py in the lab9 folder in the Class folder.
This table shows the 64 codons and the amino acid each codon codes for.
2nd base
U
1st
C
A
G
UUU (Phe/F)Phenylalanine
UCU (Ser/S)Serine UAU (Tyr/Y)Tyrosine
UGU (Cys/C)Cysteine
U UUC (Phe/F)Phenylalanine
UCC (Ser/S)Serine UAC (Tyr/Y)Tyrosine
UGC (Cys/C)Cysteine
UUA (Leu/L)Leucine
UCA (Ser/S)Serine UAA Ochre (Stop)
UGA Opal (Stop)
UUG (Leu/L)Leucine
UCG (Ser/S)Serine UAG Amber (Stop)
UGG (Trp/W)Tryptophan
CUU (Leu/L)Leucine
CCU (Pro/P)Proline CAU (His/H)Histidine
CGU (Arg/R)Arginine
C CUC (Leu/L)Leucine
CCC (Pro/P)Proline CAC (His/H)Histidine
CGC (Arg/R)Arginine
CUA (Leu/L)Leucine
CCA (Pro/P)Proline CAA (Gln/Q)Glutamine CGA (Arg/R)Arginine
CUG (Leu/L)Leucine
CCG (Pro/P)Proline CAG (Gln/Q)Glutamine CGG (Arg/R)Arginine
AUU (Ile/I)Isoleucine
ACU (Thr/T)Threonine
AAU (Asn/N)Asparagine
AGU (Ser/S)Serine
A AUC (Ile/I)Isoleucine
ACC (Thr/T)Threonine
AAC (Asn/N)Asparagine
AGC (Ser/S)Serine
AUA (Ile/I)Isoleucine
ACA (Thr/T)Threonine
AAA (Lys/K)Lysine
AGA (Arg/R)Arginine
AUG (Met/M)Methionine (Start) ACG (Thr/T)Threonine
AAG (Lys/K)Lysine
AGG (Arg/R)Arginine
GUU (Val/V)Valine
GCU (Ala/A)Alanine
GAU (Asp/D)Aspartic acid
GGU (Gly/G)Glycine
G GUC (Val/V)Valine
GCC (Ala/A)Alanine
GAC (Asp/D)Aspartic acid
GGC (Gly/G)Glycine
GUA (Val/V)Valine
GCA (Ala/A)Alanine
GAA (Glu/E)Glutamic acid
GGA (Gly/G)Glycine
GUG (Val/V)Valine
GCG (Ala/A)Alanine
GAG (Glu/E)Glutamic acid
GGG (Gly/G)Glycine
base
Source: http://en.wikipedia.org/wiki/Genetic_code
1) Find a gene in a BSequence: Add a method to BSequence called findGene, which will return the first gene
it finds or an empty string otherwise. Initialize gene to ‘’. In the BSequence, look for the codon sequence
‘TATAA’ followed (at some point) by ‘ATG’. If found, call that index start, initialize a counter i at start+3,
and set a codon to the next three elements
codon=self.seq[i:i+3]
Then, in a while loop, as long as codon is not one of the three stopping codons (be sure to exchange T for U),
move i forward by 3, get the next codon and go back up to the while again.
The s.find() function should come in handy.
At the end of the while loop be sure to assign the subsequence you just found to gene. Return the gene and test
it in the main program. Be sure to add a docstring for this function. ☺Get check 1
2) Creation of subclasses (inheritance): We will also look at genes by reading them in from files. Expand the
BSequence module by adding a new class called Gene, which will be a subclass of BSequence (this is
inheritance). It will use the BSequence constructor but also have a speciesList, which will be initialized to
[]. Later on we will add elements to this list (if those species have this gene). Write code for this
constructor. Now, in order to add species to the list for the gene, write a method called addSpecies that takes
in a string and adds it to the speciesList for the gene. Now write a printInfo() method for Gene that uses the
printInfo method from BSequence but also prints out the speciesList. In a main function test out this gene by
bringing in the file rpoNGene.fasta (this gene is present in many microbes; in some it regulates nitrogen
fixation genes and in others it regulates responses to oxygen). At this point you can add an arbitrary species
name to test it. In your work be sure to put in docstrings for modules, classes and methods (in triple quotes).
☺Get check 2
3) Write the method findAminos: In the class for Gene, write a method called findAminos constructs a
dictionary that counts the different codons (called histo) and a dictionary that counts some of the amino
acids (called aminos) and adds them to the Gene Class. In the method, first initialize self.histo to {} and
self.aminos to {}. In a while loop go through the geneSeq by 3’s and for each codon, put it in the
dictionary:
if codon in self.histo: self.histo[codon]+=1
else: self.histo[codon]=1
Now the self.histo dictionary will have a count of all the codons in the gene. Using the table above and the
histo dictionary just created, count at least five of the amino acids and put them in the dictionary self.aminos.
For example, Tryosine appears with two codes and so we would have (don’t forget to change Ts for Us):
Self.aminos['Tyrosine']= self.histo.get('TAT',0)+ self.histo.get('TAC',0)
Add these two dictionaries to the printInfo method. Try calling the new method with e_coli. Be sure to print out
the dictionaries.
☺Get checks 3 and 4
4) Creating another method: Open the module GenomeDatabse.py and try out its methods Add a boolean
method to GenomeDatabse called checkSubseq that takes a BSequence object and a Gene object as
parameters and returns whether or not the gene’s sequence is a subsequence of the BSequence (ie one is
contained in the other). If it is, then the method should also add the BSequence species name to the
speciesList of the gene. You should add code in main() that will test whether the rpon gene is a subsequence
of any of the other sequences (e_coli, rhizobium).
☺Get check 5