Download Analyzing DNA Sequence Similarity on the Computer

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Deoxyribozyme wikipedia , lookup

Genomic library wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Human genome wikipedia , lookup

History of genetic engineering wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Genome evolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene wikipedia , lookup

Designer baby wikipedia , lookup

Pathogenomics wikipedia , lookup

Sequence alignment wikipedia , lookup

Microsatellite wikipedia , lookup

Non-coding DNA wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Point mutation wikipedia , lookup

Koinophilia wikipedia , lookup

Genome editing wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genomics wikipedia , lookup

Metagenomics wikipedia , lookup

DNA barcoding wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Analyzing DNA Sequence Similarity on the Computer
You are studying biodiversity in the Kalahari Desert of southern Africa by
collecting and sequencing DNA samples from the environment. During your
studies, you have recovered a fragment of genomic DNA, but you are unsure
what organism it is from. First, you need to identify the species from which this
DNA fragment is derived. Then, find all known genes that are related to that
gene’s DNA sequence. You will then compare the sequences of your gene to
those relatives for which the genomic DNA sequence is known. You will learn
how to use this information to determine the evolutionary relationships between
different species.
Instructions:
Part A: Identify the origin of this DNA fragment.
1. Open a web browser and direct your browser to
(https://blast.ncbi.nlm.nih.gov/Blast.cgi) or search for ncbi BLAST.
2. Select nucleotide BLAST and copy (command+c) the following DNA
sequence and paste it (command+v) into the dialogue box titled “enter
query sequence”:
>UnknownDNA
GGCCGCATCTGGAGTGCAAATGCTGCCAGTGTCCCCGAACCAAA
TACAAATTTAGCGAGCGGCATCGGGACGGGCAGGGTTGGAGGG
CCGAGTGCTCGAGTCGGCCGGACAAGTCGGGATTCCTTAAACAG
CTGCGGCACCAGCTCAGAACTATCGGCTGCATTAGACCCGGAAT
ACCAAGCTCAAGCCCATGGCCATGGGCAATTGGGTGGACAGTCG
CAACGACATGGTCATGGACATAGTAAGCCACATATTAAGAGGCCC
ATGAATGCGTTCATGGTCTGGGCCAAGGATGAGAGGCGCAAAAT
ATTGAAAGCCTGTCCGGACATGCACAATTCAAATATTTCAAAAATC
TTGGGCGCTCGCTGGAAAGCGATGTCGAACGCTGACAAGCAGCC
CTATTACGAGGAGCAATCGCGTCTATCCAAGCTGCACATGGAACA
ACATCCCGACTATCGCTATCGTCCACGTCCAAAACGCACTTGCAT
CGTCGACGGCAAAAAAATGCGCATATCTGAATATAAAGTTCTGAT
GCGTAATCGACGTGCGGAAATGCGACAGCTTTGGTGTAGAGGTG
GCGGACCTGGTGGTCCGACAGGTTCCAGCTCTAATGGACATAGC
CACGATGCTGGCTCCGCCGTGCAAGCTGCAGCAGCAGCAGCCG
CTGCCGCCTACCACCTGCAGGAGATGAGTCATGCAATTGTCGAC
GATTGTGAAACGCCACCGCCGCCGGCTCAACTACTAGACTCTGG
AGCCATCGCGAGCAGCAGTAGCAATTTTTATTATCCACCGGAAAG
TTTATCGCCATCGGGTTTCTCCTCCGAGGACATGGAAATTTTATCA
TTGCGTGATGACGACTAAGAAACAAGATTATGTCGTTTATATACGT
ATACATATGTATAGAGACATACGCGGATTCAAATAGATTGTACTTA
TAGCTTGTAAATACTAAAATCATTGTATAGGCCTTTAAGTTTTAGTT
CAAGTGCCA
3. Set the Database option to “Others (nr etc.)” and the Program option to
“Somewhat similar sequence (blastn)”, then click the BLAST button in
the lower left hand corner.
4. What species does this DNA fragment come from?
_______________________________
Part B: Search a genome database for all known closely related genes to
your gene of interest.
Your initial search with this DNA sequence has yielded a match to a gene for
an uncharacterized protein with no immediate matches to a wellcharacterized reference sequence protein. Find out more about this gene by
translating the DNA sequence to protein sequence.
1. Return to the BLAST Home Page and select “blastx.” Under the title
“enter query sequence,” paste the same DNA sequence as in Part A.3.
2. In the “Choose Search Set” section, select from the pull-down menu the
option “Non-redundant protein sequences (nr).”
3. Click the checkbox to Exclude “Models (XM/XP).”
4. Click the blue button that says “BLAST.”
The database of all known sequences is now being searched for those that
resemble your sequence. The page will automatically update until the search is
done. This may take a minute or two. Be patient.
5. Scroll down past the box of red lines, and you will see information in a
chart that looks like this. This chart is just an example – it is not the one
you will actually see.
6. Now looking at your matches, these matches are already sorted by
highest to lowest match scores. Using the right hand column labeled
“Accession” find the highest scoring match that has an Accession code
starting with the letters “NP_”, which is an indication that this protein
sequence has been characterized and annotated. What species and gene
are identified in the description of this match?
_________________________________________________________
7. Record the eight different species that contain the closest known
matches to your sequence. For each species, write its Latin name, the E
value that indicates how statistically good the match is, and the identity
value that indicates the percentage of your sequence that matches the
database sequence.
** Do NOT list the same species more than once in your chart. List 8 different
species
Species name in Latin
E-value
% Identity
(Genus species)
1.
2.
3.
4.
5.
6.
7.
8.
Part C: Creating a phylogenetic tree based on the genomic information for
this gene of interest
Now that you know this gene is one that is conserved across multiple species,
you can use compare your DNA sequence to the sequences of other species in
order to create a phylogenetic tree that shows just how closely related species
are. While BLAST conducts pairwise comparisons, you now need to align and
compare multiple sequences. You can do so using a program called Clustal.
1. Click on each of the 8 accession numbers associated with the species you
have identified above in Part B.7. Clicking on an accession number will
link you to a new page that gives you move information about that
particular sequence.
2. On the left side just under the title heading, click on the link that says
“FASTA”. This will show you the amino acid sequence.
3. Select and copy the amino acid sequence into a new plain text document.
The first line should include a right pointing bracket with the name of that
species, for example:
“ >speciesX
ZZZZZZZZZZZZZZZZZZZZZZ”
4. Do the same for each of the 8 species you identified above.
5. Once you have all 8 sequences in one text file, go to the Clustal webpage:
http://www.ebi.ac.uk/Tools/msa/clustalo/ or search for Clustal Omega
6. In the dropdown menu, the option should be set to “Protein” and then
paste all of the sequences from you text file into the input window labeled
“sequences in any supported format” Finally, scroll down and click “Submit”
7. Using the alignment information, create a rough sketch of the phylogenetic
tree, given which species are most similar based on this one gene.
a. (An asterisk “*” indicates complete alignment across all samples, a
colon “:” indicates conservation between amino acids with strongly
similar properties, a period “.” indicates conservation between
amino acids with weakly similar properties)
b. It helps to look for shared patterns. For example:
Figure 1: Shared sequences between species 2 and 3 (in green), species 5 and 6 (in purple), or species 7
and 8 (in blue) indicate that these pairs of species are likely to be very closely related. Shared residues
highlighted in red indicate that Species 5-8 may have shared a common ancestor at one point, and that
species 1-4 are more distantly related.
8. Compare your phylogenetic tree to the one that Clustal generated by
clicking on the “Phylogenetic tree” tab just under the Results title.
Questions
1. For a given gene, is the DNA sequence or protein sequence going to have
a greater degree of conservation between species?
2. If mutation processes randomly affect the genome without any specific
targeting, why are some regions of the genome more faithfully conserved
than other regions?
3. Is a single gene region sufficient to determine speciation? Why or why
not?
4. How does your phylogenetic tree match the table of e-values and %
identity scores that you found in your BLAST search?
Bonus: Why is the species you identified an unexpected result?