Download Biology 164 Laboratory Introduction to Bioinformatics and Molecular

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of molecular evolution wikipedia , lookup

DNA barcoding wikipedia , lookup

Biochemistry wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Expanded genetic code wikipedia , lookup

Protein structure prediction wikipedia , lookup

Point mutation wikipedia , lookup

Homology modeling wikipedia , lookup

Molecular evolution wikipedia , lookup

Genetic code wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Biology 164 Laboratory
Introduction to Bioinformatics and Molecular Genetics
(Based on a lab exercise developed by Henrik Kibak, 2004)
Skills developed in this lab
•
•
•
•
•
Use of National Center for Biotechnology Information (NCBI) databases
Retrieval of sequences from NCBI
Alignment of homologous sequences using Clustalx
Using ClustalX output to prepare phylogenies
Testing evolutionary hypotheses
Overview
In the Protein Fingerprinting exercise you performed in lab a few weeks ago, we used molecular
genetic techniques to develop a phylogeny of related vertebrate species. In that exercise, species were
distinguished using the electrophoretic mobilities of their various muscle proteins. In today’s exercise
we will distinguish different species by comparing the amino acid sequence of a specific mitochondrial
protein. In a subsequent exercise you will distinguish different species by comparing the nucleotide
sequence of a particular gene.
Bioinformatics is emerging as a hugely important field affecting all areas of biology. While
bioinformatics is formally the application of computer technologies to biological sciences - ranging from
automated analysis of microarrays containing thousands of individual experiments to the development
of browser tools for looking at whole genomes - students in all areas of biology need to be familiar with
software tools developed by bioinformaticians to accomplish routine tasks in biology.
As a demonstration exercise, we will look at the taxonomic position of Euglena using the sequence of
amino acids for the mitochondrial protein, Cytochrome C.
After that exercise you will have the skills necessary to answer the question: “ Did Darwin’s finches
arise from an ancestral species that migrated to the Galapagos Islands directly from the distant
mainland, or did they arise from a species inhabiting a more closely located island?” You will answer
that question during next week's lab
Important note: for all of the web-based activities it is important to use the Safari web browser for
accessing websites and downloading files.
Taxonomic Position of Euglena
Algae are protists with chloroplasts. However, Euglena is a protist genus where some species have
chloroplasts and others don't. So, are they more closely related to animals or plants?!!
To answer that question we will use the resources of the National Center for Biotechnology Information
(NCBI). First, we will download protein sequence data for Euglena and a small number of plants and
animals. We will then use the sequence alignment software, Clustalx, to analyze the similarities and
differences among the sequences. Based on this analysis, Clustalx will compute the genetic distances
among our group of organisms. Finally we will plug the genetic distances into the phylogenetic treebuilding software, N-J Plot, to develop of phylogeny of the organisms.
Log onto the NCBI homepage using the bookmark established in Safari.
Introduction to Bioinformatics
Page 1
It is impossible to provide a reasonable guide to even a small section of this tremendous resource...
you will have to explore it yourself. For example search “All Databases” for “Alces”
This will bring you to the Entrez search engine, which lets you navigate through the data resources of
NCBI. As you can see, there is a vast amount of information cataloged for Alces, a very popular Maine
genus.
Try clicking on “PubMed: biomedical literature citations and abstracts”
You will see a diverse array of literature citations relating to the genus Alces.
Go back to the “Entrez” page. Note that in addition to literature searching, Entrez allows you to search a
variety of genetic data resources such as nucleotide and protein sequences.
To see what is available for Euglena let's enter that instead of Alces. Go ahead and refine the search a
bit by clicking "Protein" and adding the search modifier for "organism" like this:
Euglena [orgn]
Introduction to Bioinformatics
Page 2
That should reduce the number of hits a bit. Adding "cytochrome c" with quotes like this should help a
lot:
Euglena [orgn] "cytochrome c"
Finally, if you add the search modifier for "protein" like this:
Euglena [orgn] "cytochrome c" [prot]
...it should knock it down to two hits that include the Cytochrome C sequences for Euglena viridis and
Euglena gracilis that were obtained many years ago by direct protein sequencing.
Click on the first accession number for the Euglena viridis sequence. This will bring you to a reference
page that documents background information on the origin of the sequence, principal investigators,
journal references, etc.
In the window next to the “Display” button, select “FASTA” and then click “Display.” This will bring you
to the amino acid sequence of the Euglena viridis Cytochrome C protein in FASTA format. The FASTA
format is the primary format for sequence data that is recognized by bioinformatics software.
In the window next to the “Send” button, select “this page to text” and then click “Send”
Introduction to Bioinformatics
Page 3
Save the Euglena viridis sequence to that folder as a text file called "Cyt_c_Eug_vir.txt" In doing so,
create a New Folder on the desktop named “Seqs” to store your sequence data.
The letters in the sequence data correspond to the specific amino acids that comprise the Cytochrome
C protein. The following single letter code is used to designate the various amino acids.
Return to the Protein search page by backclicking, and erase your previous search terms and try typing
in "Cytochrome C" in quotes... what results do you get when you search?
You should see "Page 1" of at least "2,567 pages" of results!!! A bit more than Alces...
To refine the search try adding [prot] after the "Cytochrome C" - that should get it down to only 23
pages of results (!).
Finally try adding "mammalia" to the search terms as in the example below:
What do you see? You should see that the results have been narrowed to 47 items (2005) on 3 pages.
Click on P0007. What organism was this sequence obtained from?
If you wanted to download this sequence, you would follow the same steps outlined above for Euglena
viridis.
In order to save time I have downloaded five homologous Cyt C sequences from five different
organisms for us to use in this exercise. The five sequences have been saved in a file called
“all_five.txt”
Introduction to Bioinformatics
Page 4
Make a copy of this file using the “Duplicate” command of the “File” menu of the Finder, and place the
copy in the folder you created named “Seqs”
Th e C yt och ro me C seq ue nce s w e w ill u se (in FA S T A form at ):
>Arabidopsis gi|4539007 Cytochrome c [Arabidopsis thaliana]
MASFDEAPPGNPKAGEKIFRTKCAQCHTVEKGAGHKQGPNLNGLFGRQSGTTPGYSYSAA
NKSMAVNWEEKTLYDYLLNPKKYIPGTKMVFPGLKKPQDRADLIAYLKEGTA
>Euglena GI|117985:1-102 Cytochrome c [Euglena viridis]
GDAERGKKLFESRAGQCHSSQKGVNSTGPALYGVYGRTSGTVPGYAYSNANKNAAIVWED
ESLNKFLENPKKYVPGTKMAFAGIKAKKDRLDIIAYMKTLKD
>Hippo gi|65451 Cytochrome c [Hippopotamus amphibius]
GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQSPGFSYTDANKNKGITWG
EETLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKQATNE
>Mosquito gi|31202411|ref|XP_310154.1| [Anopheles gambiae]
MGVPAGDVEKGKKLFVQRCAQCHTVEAGGKHKVGPNLHGLFGRKTGQAAGFSYTDANKAK
GITWNEDTLFEYLENPKKYIPGTKMVFAGLKKPQERGDLIAYLKSATK
>Rice gi|218249 Cytochrome C [Oryza sativa (japonica cultivar-group)]
MASFSEAPPGNPKAGEKIFKTKCAQCHTVDKGAGHKQGPNLNGLFGRQSGTTPGYSYSTA
NKNMAVIWEENTLYDYLLNPKKYIPGTKMVFPGLKKPQERADLISYLKEATS
Preparing sequences for alignment using Clustalx
The Clustalx software runs a mathematical algorithm that aligns multiple sequences in ways that
minimize the differences between them. If you think about the types of changes that occur to genes
over time, e.g., point mutations, reading frame shifts, codon transpositions or deletions, etc., you begin
to see how proteins can change as well. During the alignment procedure, Clustalx uses a variety of
approaches to account for these different types of changes by shifting sequences in relation to one
another, and by adding small gaps to make up for deletions that may have occurred (keeping track of
these modifications of course). Once the sequences are aligned, Clustalx then computes the genetic
distance between every possible combination of sequences. Calculation of the genetic distance is fairly
complex but, in simple terms, it is a measure of evolutionary divergence between homologous
sequences. The greater the genetic distance, the more distantly related the sequences are.
Use MS Word to open your renamed copy of “all_five.txt”
In addition to the list of amino acid residues for each protein, the sequence data downloaded from NCBI
contains other descriptive information such as the accession number, common and scientific names of
the organism from which the sequence was obtained, etc. You will need to edit out some of the
descriptive information now so that it does not clutter up the phylogenic tree you will build later.
Edit out all of the descriptive information except for the common name. Make sure not to remove the
“>” character, since that is how Clustalx knows a sequence is beginning.
Save your changes and close out of Word.
Launch Clustalx and use the File menu to Load your copy of “all_five.txt”
Introduction to Bioinformatics
Page 5
Use the size box in the lower right-hand corner to stretch out the window so that you can see the entire
length of all the sequences. At this point the sequences are in their original state and have not been
aligned yet.
If you find the colors distracting, you can set them to Black and White in the Colors menu. The colors
correspond to different groups of amino acids. For example, red corresponds to the amino acids with
basic R groups, while green corresponds to those bearing non-polar R groups.
Note how the lengths of the sequences differ from one another. Which organism has the shortest
sequence? Which have the longest? Which appear to be most similar to one another? Which are
most different? If you want to position two sequences adjacent to one another so you can compare
them without the other sequences in between, use the cut and paste commands in the File menu to do
so.
The gray bars under the ruler indicate how well-conserved amino acids are at each position. The
higher the bar, the more conserved the sequences are at that position. Obviously, in the un-aligned
state not many bars are very tall. If corresponding conserved regions exist among the homologous
sequences, you will see that the gray bars will be much taller after the alignment procedure . . . So go
ahead and align the sequences.
From the “Alignment” menu choose “Do Complete Alignment”
You should note that three things have occurred during the alignment. First, the gray bars under the
ruler have grown in length!! Shifting the shorter sequences to the right has revealed stretches within
the sequences that are well-conserved. Second, Clustalx has rearranged the positions of the
sequences to place those most similar next to one another. Third, symbols have been placed in the
gray horizontal bar above the sequences to indicate highlights of the conserved features as follows:
“*” indicates a single fully conserved residue
“:” indicates ‘strong’ amino acid R groups conserved
“.” indicates ‘weak” amino acid R groups conserved
Introduction to Bioinformatics
Page 6
Before leaving Clustalx you will need to run an algorithm that calculates the genetic distances between
the aligned sequences.
From the “Tree” menu choose “Draw N-J Tree”. An actual tree will not be drawn, but a file will be
written that will contain the genetic distance data that can be used to construct a phyologeny. We will
use another application to actually draw the tree.
Go to the Finder and look inside your “Seqs” folder to see the files that Clustalx has written to it. Your
folder should look something like this:
You should be able to recognize the files you originally placed in the folder, but in addition there will be
three files that were written by Clustalx. The file with the .aln extension is the sequence alignment file.
It can be opened by Clustalx if you would like to see the aligned sequences again. The files with the
.dnd extension and with .ph extension are genetic distance files formatted for use by software other
than Clustalx. We will use the file with the .ph extension (an abbreviation for Phyllip, a popular tree
building application) to actually construct our phylogenetic tree.
Building a phylogenetic tree using the application N-J Plot
In 1968, a graduate student in Japan, Matatoshi Nei, was reading a paper about the proportion of
different isozymes in sibling species of Drosophila when he realized it could be possible to build a
phylogenetic tree based on the similarities and differences (genetic distance) that existed between the
isozymes. The mathematical method he developed in conjunction with Naruya Saitou for building such
trees came to be known as the ‘Neighbor-joining Method’, hence the name N-J tree. In the
phylogenetic trees you have made using MacClade, the relationship between species is determined by
their membership in one or more clades. In the MacClade approach, all branches of the tree are
essentially of equal length, only the location of the branch points on the tree differs from one species to
the next. With the Neighbor-joining method, genetic distances are used to actually calculate the
lengths of the tree branches. This is a very powerful approach for building phylogenies that was made
possible by the advent of molecular genetics.
Launch N-J Plot, and open the genetic distance file created by Clustalx that has the .ph extension. The
file should be located in your “Seqs” folder.
Introduction to Bioinformatics
Page 7
Your tree should look something like this:
Notice that most of the branches are not the same length. In fact there is a scale in the upper righthand corner that shows how the length of an individual branch corresponds to the genetic distance to
its nearest neighbor. In most cases the ‘nearest’ neighbor will be a presumed ancestral species. The
genetic distance between any two current species (e.g., the species at the tips of the branches) is
found by adding up the lengths of the branches in between. This level of precision was not possible in
phylogenetic systematics before the Neighbor-joining method was developed.
Try clicking on the “Branch lengths” box to display the specific genetic distances on the branches of the
tree.
You should now be able to answer the question that was originally posed concerning where to place
Euglena with respect to the plants and animals used in our study.
Is Euglena more closely related to plants or animals?
Is Euglena more closely related to a hippo or a mosquito?
Do the results surprise you?
Remember that this phylogeny is only based on genetic distances calculated from sequence data
derived from just one specific protein!!
For the second half of lab today you will use the skill you developed in this exercise to evaluate two
hypotheses regarding the evolution of Darwin’s finches. However, instead of using protein sequences,
you will use finch mitochondrial DNA sequences that have been archived at NCBI.
Introduction to Bioinformatics
Page 8