Download In-Lab Handout

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ancestral sequence reconstruction wikipedia , lookup

Molecular evolution wikipedia , lookup

Expanded genetic code wikipedia , lookup

Genetic code wikipedia , lookup

Biosynthesis wikipedia , lookup

Point mutation wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
BIO 102 Renn_Lab#1 (in-lab handout)
Name ___________________
Intro-bio 102 Lab #1 Perl Programming for Pattern Matching.
There are 9 exercises in total:
USE PERL PROGRAM regex.pl
1. Scrabble in English
2. “Scrabble” in DNA Sequences
3. Direct Repeats
4. Direct Repeats in DNA Sequences
5. Mirror Repeats also called palindromes in English (English Word Play)
6. Mirror Repeats in DNA Sequences
USE PERL PROGRAM book_search.pl
7. Pattern Matching in Large Texts.
USE PERL PROGRAM omic_search.pl
8. Pattern Matching for Protein Sequences
9. Pattern Matching for Protein Sequences in DNA
There is more if you finish early and are still enjoying yourself!
Note: If you run a Perl program and it runs on and on ... and on and on, you’ll need to
HALT the program.
Hold down the keys: Option-Apple-Escape, select the TextWrangler application, and
click "Force Quit".
In your lab notebook keep a record of the regex’s that you try. Keep track of
whether or not they worked the way you expected them to. Write down your errors
as diligently as your successes, these are what you will learn from. Whenever you see
gray highlighted text, you should be writing in your lab notebook. At the end of lab,
you will trade notebooks to learn from each other’s efforts.
__ Logon to your computer using your own name and password
__ Connect to the courses server and open the week1_Perl_lab folder in Behavior-Renn.
__ Drag the folder regexPlay onto your desktop.
__ Open this folder and you should see the following files:
anagram.pl
book_search.pl
Perl program to use regex to find anagrams
Perl program to use regex to return full
paragraphs
ecoli_K12_genes.fasta
DNA coding sequence for predicted E.coli
proteins in FASTA format
ecoli_K12_proteins.fasta amino acid sequence for predicted E.coli
proteins.
ecoli.txt
7-mers found in E.coli genome
english.txt
English dictionary (list of words one per
line)
omic_search.pl
Perl program to use regex to return genes
with specified pattern
origin.txt
Full text of Darwin’s “Origin of species”
regex.pl
Perl program to use regex to find patterns
Perl is already installed on the intro-bio lab computers for you.
pg. 1 of 14
BIO 102 Renn_Lab#1 (in-lab handout)
Name ___________________
__ Open the application TextWranglerTM from the Dock (Gold “W”).
__ From the FILE menu in TextWrangler, select Open (or use o) and open the Perl
program: regex.pl from the folder on your desktop. The Perl program will appear in
your programming environment. It will be color coded according to Perl syntax. The text
that you will work with will be the color coded pink text under the words “ENTER
HERE”. Notice the “#” sign is used to frame the code and include comments.
__ You should also see a “Documents Drawer” on the right. If not, go to the View menu
and select Show Documents Drawer.
__ Before making any changes, make your own version of this script by saving it as
regex_yourname.pl From the File menu select Save As, give the file its new name,
check that you are saving it in the regexPlay folder on the desktop.
__ If you do not see line numbers next to the code use the Edit menu to open Text
Options and check the box under Display: for Line Numbers.
This is the only section you will be changing:
######################################################################
# ==========ENTER HERE============
my $regex
=
'q[^u]';
my $filename
# my $filename
=
=
"english.txt";
"ecoli.txt";
######################################################################
(note: The word “my” is Perl’s way of initializing variables. It is necessary here.)
To write a new regex, you will change the pink text between the single quotes.
Note: you must keep the single quotes!
The initial regex supplied in this Perl program means:
“match all words that contain a ‘q’ that is followed by a letter that is not ‘u’.”
____You must also choose the file to search:
either the English dictionary (english.txt)
or the list of 7-mers in E.coli (ecoli.txt).
You will make this selection by putting a comment (#) symbol in front of the line that
you do not want to use.
The initial program is set to search the file "english.txt", thus there is a # to
“comment out” (shut off) the "ecoli.txt” file.
Note: After you change the regex expression, you must SAVE your changes to the
file before you re-run the new program. Your program is NOT saved if it has a dark
diamond next to the title in the Documents panel. You can save your Perl Program
with S or using the pulldown menu from File.
____To run your program from TextWrangler use the #! Menu and select Run
(this combination of symbols is call hashbang or she-bang).
Your program will run, and a new file with your results will open.
You will see a new document name added to the Documents panel called “Unix Script
Output.” This file is not currently saved, and you do not need to save it. Subsequent runs
pg. 2 of 14
BIO 102 Renn_Lab#1 (in-lab handout)
Name ___________________
will be added below, and it maybe a very long file by the end of the day. You should
record all of your coding attempts in your lab notebook as you work. When your
attempted code fails to give the expected results, try to figure out why and correct it.
Record this learning process in your lab notebook.
___ Run your program now before you change anything in the code.
How many words were there with a ‘q’ that is followed by a letter that is not ‘u’. (Write
your findings in your lab notebook.).
____To Return to the script, click on the document icon to the left of your saved
regex_yourname.pl in the document drawer, or use the back-arrow button
Backarrow
button
Document Drawer
pink regex text
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Yellow highlighted
text at cursor
Open diamond = saved file
Black diamond = unsaved file
Bold text = working file
OUTPUT of initial program:
counts of words
returned
Work your way through the rest of the handout.
Always start at the top of each left hand page.
This is a demonstration using the English dictionary and English letters. Each
page then contains a “going back for more” that you must solve, and record your
solution in your lab notebook. You are asked to be creative using the regular
expressions that you have learned on each page.
Then, move to the facing page, the right hand page.
This is a demonstration using DNA or Protein code rather than English.
Again, there is an example followed by one or more assigned problems, and an
opportunity to be creative.
The answers are available from your instructor.
Working with your fellow students is encouraged.
Please ask for and offer help to each other before looking up the answer.
pg. 3 of 14
BIO 102 Renn_Lab#1 (in-lab handout)
Name ___________________
Scrabble in English
Question: Are there any words in an English dictionary where the letter ‘q’ is not
immediately followed by the letter ‘u’?
Pattern to match expressed in English: Search for two adjacent characters: the letter ‘q’
immediately followed by any character that is not the letter ‘u’.
[…]
[^…]
{n}
{n,}
{x,y}
[…]{x,y}
|
match any one of the characters in the set
match any one character that is not in the set
match any of the characters in the set exactly n times
match any of the characters in the set at least n times
match at least x and not more than y times
these can be combined
match any of the characters in the set at least x and not
more than y times
a vertical bar separates alternatives
this is OR in Boolean logic.
regex = ‘q[^u]’
(1) Iraqi
(2) Iraqis
(3) Qatar
Note: In these facing page examples, we are assuming that the Perl program is instructing
each regex to ignore the case of letters. You need not worry about upper vs. lower case in
your regex. In the example above, “q” is really handling both upper and lower case ‘q’.
This is part of the work that the Perl program is doing for you. After you play with regex,
we can look at the code if you are interested.
Going back for more:
Select regex_yourname.pl in the Documents Drawer to return to your Perl script.
Edit your new regex in pink text between the single quotes, and then save your changes
before running the script from the hash-bang menu #! . If you forget to save, it will run
the previous regexs.
(1a) All words that contain the three-letter string “ghi”.
Note: Do not confuse this with the request to find words with any one of the letters ‘g’, ‘h’, or
‘i’. [ghi] is not the regex you want. [ghi] finds words with one of the letters in the set.
(1b) All words that contain “fin” or “phin”.
(1c) All words with ‘yz’ that are not immediately followed by an ‘e’ or an ‘i’ .
(1d) All words with.... (make up your own).
pg. 4 of 14
BIO 102 Renn_Lab#1 (in-lab handout)
Name ___________________
“Scrabble” in DNA Sequences
Our context for DNA sequences is a file with one DNA motif (or word) per line. The file
is comprised of 16,384 7-mers, each line holding one DNA sequence or motif of length
seven(7), where a motif is defined as “a pattern with putative biological meaning”. The 7mers were collected by us from NCBI’s publicly available DNA sequence for the E.coli
bacteria.
Change the file that will be searched by inserting and deleting the hash sign # as shown.
# my $filename
= "english.txt";
my $filename = "ecoli.txt";
Question: Are there any 7-mers in E. coli where the nucleotide ‘G’ (guanine) is not
immediately followed by the nucleotide ‘T’ (thymine)?
Search for two adjacent nucleotides: guanine ‘G’ followed by any nucleotide that is not
thymine ‘T’.
regex = ‘G[^T]’
TAAAGAA
ACGTGCC
GATATTT … 11267 total
The program is checking All Motifs for a G followed by a letter that is not T
TCAGTGT no
GTTCACG no
TAAAGAA yes
AGTAGTG no
CTTTTTT no
ACGTGCC yes
ACTCATT no
etc
Going back for more:
(2a) All 7-mers that contain GTGAC.
(2b) All 7-mers that contain the dimer GC, and this dimer is immediately followed by a
letter that is not a G nor a C.
(2c) All 7-mers where ... (write your own)?
pg. 5 of 14
BIO 102 Renn_Lab#1 (in-lab handout)
Name ___________________
Direct Repeats
Question: Are there any words in which a sequence of three letters is repeated
elsewhere in the word?
Note: Don’t forget to switch back to the English text.
Search for words where a sequence of three letters is repeated. This pattern might appear
in the middle of a longer word.
.
(.)
.*
.+
.?
\1
\2
\3
Match any character except a new line
Match any character and remember it
Match any character zero or more times
Match any character one or more times
Match any character zero or one time
Recall the character from the 1st match
Recall the character from the 2nd match
Recall the character from the 3rd match
regex = ‘(.)(.)(.).*\1\2\3’
(1) acclimatization
(2) agglutinating
(3) aggressiveness
(4) Albuquerque
(5) alfalfa
(6) allegorically
(7) amalgamate
(8) amalgamated
(9) amalgamates
(10) amalgamating … 305 total
Going back for more:
(3a) Are there any words in which a sequence of two letters is directly repeated at least
three times in the word?
(3b) Are there any words ... (write your own)?
(Note, “directly” refers to “in the same order”. It is different than “immediately” in that it
allows for additional characters to come between the repeated motif.) (How would you
change the regex to require an immediately repeated 3 letter motif?)
pg. 6 of 14
BIO 102 Renn_Lab#1 (in-lab handout)
Name ___________________
Direct Repeats in DNA Sequences
Question: Are there any 7-mers in the DNA sequence of E.coli where a sequence of
three nucleotides is repeated elsewhere in the motif?
Search for motifs where a sequence of three nucleotides is repeated.
regex = ‘(.)(.)(.).*\1\2\3’
1) AAAAAAA
(2) AAAAAAC
(3) AAAAAAG
(4) AAAAAAT
(5) AAACAAA
(6) AAACAAC
(7) AAAGAAA
(8) AAAGAAG
(9) AAATAAA
(10) AAATAAT … 700 total
Going back for more:
(4a) Are there any motifs in which a sequence of two nucleotides is repeated at least three
times in the motif?
(4b) Are there any motifs ... (write your own)
pg. 7 of 14
BIO 102 Renn_Lab#1 (in-lab handout)
Name ___________________
Mirror Repeats (also called palindromes in English)
Question: Are there any words in which the first three letters of the beginning of a
word are the same as the reverse of those letters at the end of the word?
Search for words where the initial sequence of three letters ends with the reverse of those
letters.
regex = ‘^(.)(.)(.).*\3\2\1$’
(1) despised
(2) detected
(3) deteriorated
(4) detested
(5) foolproof
(6) Greenberg
(7) Hannah
(8) redder
(9) reviver
(10) revolver
(11) rotator
remember:
^ab
Match ‘ab’ at the beginning of a word
xyz$ Match ‘xyz’ at the end of a word
\2
Recall the character from the 2nd match
Going back for more:
(5a) Are there any words in which three consecutive letters anywhere in a word are
followed by the reverse of those letters anywhere in the word? (Hint: you do not need ^
(start of word) and $ (end of word) for this).
(Note: the following two examples combine both direct and mirror patterns)
(5b) Are there any words in which a pair of letters is followed by a direct repeat of those
pair of letters followed by two occurrences of the reverse of the pair of letters? Each of
the pairs can be zero or more letters apart), e.g., “AB...AB...BA...BA”
(5c) Are there any words in which this pattern occurs twice: a pair of letters is followed
by its reverse of those pair of letters? (Each of the pairs can be zero or more letters apart).
pg. 8 of 14
BIO 102 Renn_Lab#1 (in-lab handout)
Name ___________________
Mirror Repeats in DNA Sequences
Question: Are there any motifs in which the first three nucleotides of the beginning
of a motif are the same as the reverse of those letters at the end of the motif?
Search for motifs where the reverse of the initial three nucleotides is repeated at the end
of the motif.
regex = ‘^(.)(.)(.).*\3\2\1$’
1) AAAAAAA
(2) AAACAAA
(3) AAAGAAA
(4) AAATAAA
(5) AACACAA
(6) AACCCAA
(7) AACGCAA
(8) AACTCAA
(9) AAGAGAA
(10) AAGCGAA … 256
remember:
^ATG
TAC$
(.)
\2
Match ‘ATG’ at the beginning of a motif
Match ‘TAC’ at the end of a motif
Match any character and remember it
Recall the character from the 2nd match
Going back for more:
(6a) Are there any motifs in which three consecutive nucleotides anywhere in a motif are
followed by the reverse of those nucleotides anywhere in the motif?
(Hint: you do not need ^ (start of motif) and $ (end of motif) for this).
(Note: the following two examples combine both direct and mirror patterns)
(6b) Are there any motifs in which a pair of adjacent nucleotides is followed by a direct
repeat of those pair of nucleotides followed by two occurrences of the reverse of the pair
of nucleotides? (Each of the pairs can be zero or more nucleotides apart), e.g.,
“GT...GT...TG...TG”.
(6c) Are there any motifs in which this pattern occurs twice: a pair of adjacent
nucleotides is followed by its reverse of those pair of nucleotides?
(Each of the pairs can be zero or more nucleotides apart).
pg. 9 of 14
BIO 102 Renn_Lab#1 (in-lab handout)
Name ___________________
Pattern Matching in Large Texts.
Perl can be a powerful tool for searching and manipulating larger text documents.
We have downloaded the full text of The Origin of Species by means of Natural
Selection, 6th Edition by Charles Darwin.
You will write regex’s to find interesting words in this text.
Choose words or concepts from Bob Kaplan’s lectures first semester.
In TextWrangler close (w) the regex.pl script and open (o) book_search.pl
This script will return the entire paragraph that contains your interesting word so
you can learn how Charles Darwin used these terms.
Search the text origin.txt
You could search any other downloaded text by changing line #13 in the program.
(my $filename = "origin.txt";
#enter the text file name here)
(9a) How many times do you think Darwin used the word “evolution” in his text?
Write your guess in your lab notebook.
(9b) Write a regex to find the correct answer.
This would be easy to do in a simple text program like Word using the Find tool, but
now try to find any term related to evolution (evolve, evolution, evolved).
(9c) Write a regex that will find any term related to evolution.
(9d) How many are there?
(9e) Look carefully. Did you get the words you weren’t expecting (“revolved”
“revolution”)? Look back at your answer to 9b now also
(9f) Write a regex that will avoid these words? (Hint \s means “white space” or \b means
“word boundary”)
Several special characters can be helpful regex tools in this type of search in Perl.
\n new line character
\t tab
\s whitespace
\S non-whitespace
\d digit
\D non-digit
\b word boundary
\B not word boundary
If you are finding it difficult to find your word in the full printed paragraph, comment out
line 52 and use line 53 by removing that comment mark.
52
53#
print @paragraph;
print "$match\n";
# and print the paragraph
# will print only the line that matched
pg. 10 of 14
BIO 102 Renn_Lab#1 (in-lab handout)
Name ___________________
Pattern Matching for Protein Sequences
Remember Janis Shampay’s lectures last semester?
Proteins are a string of amino acids of which there are 20. Each amino acid has a three
letter abbreviation but also a one letter abbreviation used for writing protein sequences.
* G - Glycine (Gly)
* P - Proline (Pro)
* A - Alanine (Ala)
* V - Valine (Val)
* L - Leucine (Leu)
* I - Isoleucine (Ile)
* M - Methionine (Met)
* C - Cysteine (Cys)
* F - Phenylalanine (Phe)
* Y - Tyrosine (Tyr)
* W - Tryptophan (Trp)
* H - Histidine (His)
* K - Lysine (Lys)
* R - Arginine (Arg)
* Q - Glutamine (Gln)
* N - Asparagine (Asn)
* E - Glutamic Acid (Glu)
* D - Aspartic Acid (Asp)
* S - Serine (Ser)
* T - Threonine (Thr)
What amino acid sequence would be represented by the protein sequence GENE?
In TextWrangler close (w) book_search.pl; open (o) omic_search.pl
This program will return either DNA or protein sequence for genes. It is expecting a
FASTA formatted file, rather than a book.
Use the file E_coli_K12_proteins.fasta by commenting the appropriate line.
If you want to see only one matching line of protein code, comment out line 43 and use 44.
(10a) Write a regex to find how many proteins in E.coli include the protein code
GENE.
(10b) Write a regex to find how many proteins in E.coli include the amino acids
DARWIN in any order.
(10c) Write a regex to find proteins in E.coli include …. (make up your own).
(10d) Is your name part of E.coli proteome? (If you don’t know what the word proteome
means, look it up in Wikipedia quickly http://en.wikipedia.org/wiki/Proteome ).
Regular expressions can be used to search for secondary structure.
Different amino acids have different chemical characteristics. The secondary structure of a
protein depends on the number and spacing of these different chemical characteristics.
Different amino-acid sequences have different propensities for forming -helical structure.
Methionine, alanine, leucine, glutamic acid, and lysine ("MALEK" in the amino-acid 1letter codes) all have high helix-forming propensities. Proline (“P”) tends to break or kink
helices. However, proline is often seen as the first residue of a helix.
(10e) Write a regex for an amino acid pattern that is likely to form a long (10 amino
acid) alpha helix secondary structure.
There are many web-based tools that search for complex protein or DNA patterns such as
Neuropeptide cleavage predictors (see appendix for an example). These tools rely on
programming in languages like Perl or Python. One advantage of learning to write your
own programs is that you can search multiple sequences (or whole genomes as you have
done today). You do not need an internet connection, you do not have to wait for results,
and you can format the output to be compatible with your own files. You can adjust the
parameters. Often it is possible to obtain the program code that is used by these web-based
tools, and you can then adjust specific parameters to meet your own needs.
pg. 11 of 14
BIO 102 Renn_Lab#1 (in-lab handout)
___________________
Name
Pattern Matching for Protein Sequences in DNA
Each amino acid is encoded by 3 letters in the genomic DNA (translated via mRNA).
These three bases are called a “codon”. Remember that some of the amino acids can be
encoded by more than one codon, so the code is called “degenerate”.
Do you remember the Universal Genetic Code? Of course not, nobody does.
Use the Perl script omic_search.pl
Use the file E_coli_K12_genes.fasta by commenting out the appropriate file line.
A regular expression for alanine is
regex = ‘gc[acgt]’
A regular expression for Arginine is a bit tougher
remember the vertical line symbol | means OR
regex= ‘(cg[acgt])|(ag[ag])’
(11a) Write a regular expression that will search the genomic sequence for the
possible amino acid sequence GENE.
(11b) Do you find this sequence the same number of times in the DNA as you do when
you search the Protein file for GENE? If not, what is the reason?
(11c) Write a regular expression to find a Lysine rich region (3 or more in a row).
Or make it harder. Find a region where every other amino acid is a Lysine,
Think about how you could specify which amino acids are allowed to be interspersed
with the Lysines.
(11d) Why might a researcher be interested in looking for secondary structure in a
given DNA sequence rather than looking directly at an amino acid sequence?
(11e) Why might a researcher want to write their own program rather than using a
web-based tool?
pg. 12 of 14
BIO 102 Renn_Lab#1 (in-lab handout)
___________________
Name
When you have finished
__The answer key is available in lab if you wish to check your own work.
__There is an additional exercise on finding anagrams if you are having fun, have finished
early and want to try more difficult regular expressions.
__If you want to save your output file, use the Save As command, give it a name and save
it to your Home Server. This file contains only the output, not the regular expressions you
used to get what you were looking for. Your lab notebook is the real record of your efforts.
__Before you leave, find another student who has also finished the exercises. Trade
notebooks with this student. Complete the post-lab evaluation form while you learn what
creative regular expressions your classmate has attempted.
__Take your own lab notebook home with you.
__Leave your post-lab evaluation form with Carey or Ned.
Appendix
This lab is based on Exercise 3 from “Programming in PERL for Biology” a text by Mark
LeBlanc and Betsy Dreyer from Wheaton College. The full text will be published later in
2007 and is recommended for any student wishing to go further with this work. Until that
book is published, I recommend Beginning Perl for Bioinformatics from O'Reilly &
Associates as it addresses the needs of biologists who want to learn Perl programming.






If you want to install Perl on your own computer; it comes installed with Mac OS X or it
can be downloaded for free from http://www.perl.com/download.csp
If you want to install it on your own computer, TextWrangerTM is available at
http://www.barebones.com/products/textwrangler/download.shtml TextWrangler is a
freeware programming environment provided by BBEdit. TextWranglerTM is a “lighter”
version of BBEdit®, the industry standard Perl programming environment for MAC OS
X.
If you want to learn more Perl, there are several books available as well as several online
courses. Perl is open source language and many scripts and modules can be freely found
and downloaded from the web. One good online tutorial that I have seen is available at
http://bioportal.weizmann.ac.il/course/prog/
The english.txt dictionary and 7mer text that you used in class can be downloaded from
http://cs.wheatonma.edu/mleblanc/dna/
The full text of “On the Origin of Species” (and other works) can be downloaded for
free from project Gutenberg at http://www.gutenberg.org/etext/2009 There are
>20,000 free books for download from project Gutenberg. Project Gutenberg is the first
and largest single collection of free electronic books, or eBooks. It was founded by
Michael Hart, who invented eBooks in 1971. http://www.gutenberg.org/wiki/Main_Page
Whole genomes, collections of genes, protein coding regions, upstream regulatory
regions etc. can be downloaded from many sources. For this lab, the DNA coding
sequence file and the protein file for E. coli strain K12 were downloaded from
http://cmr.tigr.org/tigr-scripts/CMR/CmrHomePage.cgi The Institute for Genome
Research, Comprehensive Microbial Resource. C elegans genomes can downloaded
pg. 13 of 14
BIO 102 Renn_Lab#1 (in-lab handout)
___________________

Name
from biomart http://www.biomart.org/index.html if you want to play with larger
genomes.
NeuroPred is a web application designed to predict cleavage sites at basic amino acid
locations in neuropeptide precursor sequences. The user can study one amino acid
sequence or multiple sequences simultaneously, selecting from several prediction
models and optional, user-defined functions. http://neuroproteomics.scs.uiuc.edu/cgibin/neuropred.py http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1538825
FASTA format
In bioinformatics, FASTA format is a text-based format for representing either nucleic acid
sequences or protein sequences, in which base pairs or amino acids are represented using
single-letter codes. The format also allows for sequence names and comments to precede
the sequences. The simplicity of FASTA format makes it easy to manipulate and parse
sequences using text-processing tools and scripting languages like Perl.
A sequence in FASTA format begins with a single-line description, followed by lines of
sequence data. The description line is distinguished from the sequence data by a greaterthan (">") symbol in the first column. The word following the ">" symbol is the identifier
of the sequence, and the rest of the line is the description (both are optional). The sequence
ends if another line starting with a ">" appears; this indicates the start of another sequence.
pg. 14 of 14