Download BIO4342 Exercise 1: Detecting and Interpreting Genetic Homology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Metagenomics wikipedia , lookup

Designer baby wikipedia , lookup

Gene wikipedia , lookup

Microevolution wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Genome editing wikipedia , lookup

Genomics wikipedia , lookup

Sequence alignment wikipedia , lookup

Point mutation wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
BIO4342 Exercise 1: Detecting and Interpreting
Genetic Homology
Jeremy Buhler
February 25, 2009
In this lab, we’ll annotate an interesting piece of the D. melanogaster genome. Along the way,
you’ll get some practice running command-line BLAST and reading its output. You’ll also have to
do some interpretation of the results to figure out what is going on.
To begin the lab:
1. Log into the BIO 4342 server (gander.wustl.edu).
2. Create a working directory (e.g. “lab1”) and cd into it.
3. Copy the following files to your working directory:
∼jbuhler/lab1seq1.fna
∼jbuhler/lab1seq2.fna
The file lab1seq1.fna contains a FASTA-formatted DNA sequence, which represents roughly
100 kilobases from chromosome X of D. melanogaster. The file lab1seq2.fna contains a much
smaller subsequence (about 4500 bases) from this region that you will use in Section 2.
Much of this lab consists of questions, which you should answer as you go. You can use your
lab notebook or simply open a text file on your laptop to write your answers, so long as you save
them somewhere. You should also write down somewhere the exact BLAST and RepeatMasker
commands you use, so that you can refer to them later if there is any question about how you ran
these programs.
1
Finding Interspersed Repeats
Before we go hunting for genes in a sequence, we should first annotate its repetitive elements using
RepeatMasker. If you haven’t yet done so, run RepeatMasker |& more at the command line to
see the program’s range of possible options.
RepeatMasker can use any of several repeat libraries, depending on what kind of sequence you
are annotating. Our sequence is from a fruit fly, so we’ll use RepeatMasker’s Drosophila repeat
library (command line option -species drosophila). Note that you must specify the library –
the default is for primates, and you don’t want to waste time looking for Alus in your fly sequence!
You can see other supported phylogenetic clades for the -species option by running RepeatMasker
with no inputs and reading the help screen.
By default, RepeatMasker also masks out simple repeats (e.g. dincleotides and trinucleotides),
as well as some so-called low-complexity DNA. Low-complexity sequences may not have highly
repetitive structure, but they consist primarily of one or two out of the four possible nucleotides.
1
Question 1: Why might it be a good idea to remove low-complexity DNA from a
sequence before running blastn? Why might it be a bad idea to do so before running
blastx? (Hint: consider proteins such as collagen with highly regular sequences.)
BLAST automatically does low-complexity filtering of DNA when appropriate, so we’ll tell
RepeatMasker not to do so by using its -nolow option.
Now that we know what to do, let’s do it: RepeatMasker -species drosophila -nolow
lab1seq1.fna. After you run this command, you should have several useful files:
• a copy of the original sequence with its repeats replaced by Ns, in lab1seq1.fna.masked;
• a summary of the repetitive elements found in the sequence, in lab1seq1.fna.tbl;
• a detailed list of repetitive elements found, in lab1seq1.fna.out.
Question 2: How many repetitive elements does your sequence contain, and what are
their types?
In the next section, we will be working with your other sequence, lab1seq2.fna. Go ahead
and run RepeatMasker on that sequence now with the same options as above.
Question 3: What is your result? Given the length of this sequence, would you expect
the same result if it had come from, say, a primate?
2
BLASTX: the Gene Hunter
We now have a number of options for how to proceed. We could look for matches to our sequence
at either the DNA or the protein level, using any one of several databases. In deciding which
comparison tool to use to begin annotating our sequence, we should consider a few factors:
1. How sensitive will the comparison be? Is it likely to find genes or other meaningful features
in our sequence?
2. How specific will the matches returned by our tool be? Will they cover the entire region (as
might be the case for a Drosophila genomic clone), or will they be confined to specific features
of interest?
3. How good is the information associated with any matches we may find? Will we be able to
interpret those matches?
4. How long will the tool take to run?
Taking all these factors into consideration, a reasonable first analysis for any organism is to
compare the DNA sequence to the Swissprot protein database using blastx. Although blastx is
more expensive than most other types of BLAST search, it is both sensitive and specific to coding
DNA and so should give us a good picture of potential genes in the sequence without a lot of
other clutter. We could increase our chances of seeing a match by searching against all proteins in
Genbank (the so-called “nr,” or nonredundant, protein database). However, any matches we see
2
in Swissprot will come with lots of information about the protein that matched, while the average
quality of information in protein nr hits is often much lower.
NB: there are also specialized databases for fruit fly, in particular FlyBase. For the moment,
we’ll just use the generic databases, but feel free to poke around at http://www.flybase.org/.
Let’s set up the BLAST command line for this search.
• The program to be used is blastall.
• We want to perform a blastx search, so use the option -p blastx.
• We want to produce HTML output, so add the flag -T.
• The -i option specifies query sequence, in this case lab1seq2.fna.
• The -d option specifies the database, in this case Swissprot. On our server, this database is
simply called swissprot; BLAST knows where to find it on the machine.
• We should save the output of BLAST in a file. You can either use UNIX redirection or specify
the -o option, followed by an output filename. The output file’s name should end in .html
so that the web server recognizes it as HTML.
Go ahead and run the BLAST search now.
To view your BLAST output, copy your HTML output file to the subdirectory “public html”
of your home directory. You can then access this file via the web server on goose. For example,
if your output was called foo.html, you could say “cp foo.html ∼/public html”, then view the
file via the URL http://gander.wustl.edu/∼your userid/foo.html.
Question 4: How many BLAST hits to distinct sequences were returned? What are
the best and worst E-values reported? Are the matches with poor E-values consistent
with those with better E-values?
When searching a large database, it’s good practice to ignore matches with poor E-values. In
principle, a match with an E-value less than 1 is unlikely to occur by chance alone. However, you
should allow a large margin of safety in interpreting E-values, mainly because the probabilistic
model on which they are based is a crude approximation of real biological sequences. As a rule of
thumb, you should be suspicious of matches with E-values higher than about 10−10 , and extremely
suspicious of matches with E-values above 10−5 .
By default, BLAST reports matches with E-values as high as 10, but you can change this
default using the -e option. For example, adding -e 1e-5 to the BLAST command line discards
any matches with E-value greater than 10−5 .
Question 5: Considering only the most reliable matches, what does BLAST say about
the content of this sequence? What caveats might you consider in interpreting these
results?
3
3
Interpreting the BLASTX Output
If all went well, you should now have strong BLAST hits to the Swallow protein. So, what do you
think? Does this sequence contain the melanogaster ortholog of Swallow? Is that all it contains?
We need to investigate further before deciding how to annotate the sequence.
You can find out more about Swallow on the web. A good place to start is to use information
from the Swissprot database, which is hand-curated and has links to many other databases. To
access the information for a protein, you need its Swissprot accession string, which is found in
the BLAST output and looks something like SWA DROME. A Swissprot accession string consists of
an abbreviated gene name, followed by an abbreviation indicating which organism the particular
protein in this entry came from.
To access a Swissprot entry by its accession, go to the Expasy web site (U.S. mirror at
http://us.expasy.org) and enter the accession in the search dialog at the top of the page. If you
like, you can also do a keyword search, e.g. “Swallow,” to find multiple related entries.
Question 6: What does Swallow do? Does your BLAST output match Swallow genes
from more than one species, and if so, which species? If you want to talk about this
gene in your own work, whom should you cite as having discovered it? (Hint: look at
the GenBank record.)
Now that we know a bit more about the candidate matches to our gene, let’s take a closer look
at the BLAST output. To produce an annotation, we need to verify that the query sequence really
does contain the D. melanogaster Swallow gene. In particular, the match should be full-length,
including all the coding exons of the gene.
Question 7: Which orientation is the Swallow gene in relative to your query sequence?
Question 8: Looking over all the matches to SWA DROME in your BLAST output, is
the entire protein matched? If not, which residues are missing? Are any regions of the
protein matched more than once at different places in the query sequence?
You should see a considerable amount of confusion in this BLAST output – missing residues,
duplicated residues, etc. As an annotator, your job is to produce order from this chaos. Let’s start
with the missing residues. Go back to the Swissprot entry for SWA DROME and find the part of the
protein that is not represented by any BLAST hits.
Question 9: Which amino acids predominate in the missing region? Given that blastx
likes to mask low-complexity sequence in the query before a search, do you have a reasonable explanation for why this part of the protein is missing?
Around base 190 of the protein, you will see a string of X’s representing masked residues in the
query. BLAST apparently decided that the protein in that region was a little too serine/asparaginerich and so marked it low-complexity. Aligning a residue to an X yields a negative score.
Question 10: Given that BLAST seems happy enough to include masked residues in
its alignments, why didn’t it include residues 78-90 of the protein? (Hint: look at the
frames of the matches ending at 77 and beginning at 91. What happens if you add
negative-scoring residue pairs to the end of an alignment?)
4
Now we need to deal with the duplicated matches. The best way to make sense of the output
is to sketch out the relative positions of all the matches to SWA DROME in the query on a piece of
paper. Note which residues of the protein match each part of the query.
Question 11: How many distinct features seem to be present at this locus? Which one
seems most likely to be the true Swallow gene? What might the other matches be, and
what biological mechanism might have produced them?
4
Further Exploration at the Genomic Level
To make further progress in determining the right annotation of the sequence, we will pull in
additional evidence from the DNA level. To do this, we will use nucleotide sequences derived from
Drosophila mRNAs.
To detect DNA-to-DNA matches, we’ll use blastn instead of blastx. Our choices for a database
against which to test our query include the Genbank nonredundant nucleotide database (also known
as “nt”, not to be confused with the protein “nr” database), or one of a few EST databases. ESTs
are pretty noisy and don’t come with easily accessible annotations, so we’ll use the nt database.
On our server, give the name nt in the -d option to BLAST to use this database.
Modify your BLAST search to do blastn against the nucleotide database, and run the modified
search against your query. Copy the output to your public html directory for viewing.
Question 12: Had your query contained a repetitive element such as a transposon,
what would have happened if you had forgotten to repeat-mask the query before running
it?
The sequences in the Genbank nucleotide database come from numerous sources, including genomic contigs from genome sequencing projects and mRNAs/cDNAs. A particularly useful class of
mRNA entries are the NCBI Refseqs, which come from a curated database of full-length mRNAs for
various genes. You can find out more about the Refseq database at http://www.ncbi.nih.gov/RefSeq/.
Refseq matches are recognizable in the BLAST output because they start with the string “ref”.
Question 13: What is the best Refseq match to the query? How good is the match to
what you think is the true Swallow gene? Based on this alignment, how many exons
does the gene have, and roughly where do the introns occur?
Question 14: How well does the Refseq match the other part of the query? Can you
see matches that were not visible at the protein level? Why might this be?
The main question at this point is whether the other set of matches to Swallow outside the
likely ortholog indicate a real gene or a pseudogene. Pseudogenes are pretty rare in Drosophila
compared to mammals, but they are not unknown.
There are at least two types of mutation that strongly suggest that a putative match to a gene
might be a pseudogene. One is a stop codon that truncates the protein prematurely in the middle
of a coding exon. You can see such internal stop codons in a blastx alignment as star (“*”)
characters. Another diagnostic mutation is a gap in the middle of an exon that would cause a
frameshift. Typically, such gaps are visible only at the DNA level, since a frameshift will terminate
a blastx alignment.
5
Question 15: Keeping in mind the exon boundaries you inferred above, can you find
evidence of premature stop codons and/or frameshift-inducing gaps that would cause
you diagnose a pseudogene adjacent to the Swallow gene? Describe any evidence you
find.
5
Summary
Question 16: Based on all the evidence gathered in this lab, how would you annotate
the query sequence? What uncertainties remain? Compose a short (a few sentences)
paragraph that you could add to an annotation database summarizing your findings.
6