Download Perl exercise 4 (Due on 15/12/2009)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genetic code wikipedia , lookup

Transcript
Perl exercise 4
(Due on 07/12/2010)
Don't forget to write well-organized scripts, use meaningful variables and to write
comments about what you are doing in each part of code. Always test your script on
several examples. If you need more biological sequences, search for appropriate
examples in GenBank. Note that the obligatory question are 1, 2, 3a, 5.
Pattern Matching
1. Write the following regular expressions (test them in a script, but send us just
the regular expressions):
a. Match a number which is composed only of even digits, including 0 (don't
allow 0 to be the first digit!)
For example it should match: 248, 4200, 6
Should not match: 100, 020, 5
b. Match a number which may be negative or positive, may have a decimal
point, but it must be smaller than 1000, and larger than -1000.
For example it should match: -132, 3.1415, 0
Should not match: 1000 -2001
c. Match a number like in (b), but if there is a decimal point, the digits after it
must be 0 or 5 (any number of '0' or '5' is allowed).
For example it should match: -132, 3.5555, 22.0505
Should not match: 1000, 2.27, -2.5050502
d. Match an RNA sequence that begins with "AUG" and ends with either of
"UAA","UAG", or "UGA". Both upper case and lower case letters are
allowed.
2. Write a script that reads a phone book file – name, address and phone number,
separated by colons (;). Here are two examples:
Yael Ginsburg;19 Herzel St. Dimona;08-3792999
Rahel Levi;36 Yefet St. Tel Aviv-Yafo;03-6447338
Print out all family names in the 03 region.
3. Write a script that reads a FASTA file of DNA sequences and finds the first
open reading frame in each one, if one exists.
Note: a reading frame start with "ATG", contains any number of codons
(nucleotide triplets) and ends with either "TAA","TAG", or "TGA".
a. Print the coding sequence you found.
b*. Print the positions of the beginning of the methionine codon and the
last nucleotide of the stop codon.
4* Now also search for an open reading frame on the opposite strand. Can you
find ALL possible reading frames on both strands?
Note: the opposite strand is the "complement"(by matching nucleotides A-T
and G-C) in reverse order.
5. Write a script to read and parse a Genbank genomic record (Use adenovirus
GenBank record available from the course site).
Find lines of coding sequence annotation (CDS), extract and print the separate
coordinates (get each number into a separate variable). Try to extract them
correctly for as many of CDS lines as you can!
Note: There could be several coordinates in a CDS line
e.g., CDS
join(503..1070,1145..1377).
Where 503..1070 and 503..1070 are 2 sets of coding sequence coordinates
6* For each CDS, extract and print the coding sequence of the gene from the
FASTA file of the genome sequence (available on the course website).