Download Creating Multiple Sequence Alignments

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Amino acid synthesis wikipedia , lookup

Gene expression wikipedia , lookup

Magnesium transporter wikipedia , lookup

Interactome wikipedia , lookup

Biosynthesis wikipedia , lookup

Protein wikipedia , lookup

Western blot wikipedia , lookup

Metalloprotein wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Biochemistry wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Genetic code wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Proteolysis wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Point mutation wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
BIT150 – Lab3
Multiple sequence alignment and Phylogenetics
Copy 09_Lab3 from Z: to C:, and open the file ‘FT proteins for MEGA.doc’.
Objective: Perform multiple sequence alignments, calculate distance matrices, and
construct phylogenetic trees, to understand and interpret relationships between species.
Activities:
A. Creating Multiple Sequence Alignments (MSA)
In this example, we will create a multiple alignment of protein sequences that will be
imported into the alignment editor using different methods. Multiple protein sequence
alignment is a central tool to infer protein function, predict protein secondary structure,
and identify residues important for protein specificity.
A1. Start MEGA4 by using Start\Programs\BioInformatics\MEGA4.
A2. In the MEGA4 window, go to Alignment|Alignment Explorer/CLUSTAL. Select
‘Create a new alignment’, and click on OK. Click on [NO] for protein sequence
alignment.
A3. Sequences can be entered either from FASTA files or by hand. We will enter the
sequences by hand, one by one. In the Alignment Explorer window, go to
Edit|Insert Blank Sequence or click on
, and repeat it to generate 8 blank
sequences. Right-click on the blank sequence name and edit the sequence name
for each protein sequence, as it is in the Word document ‘FT Proteins for MEGA’.
Copy and paste each sequence.
A4. Go to Edit|Select All to select every site for all the protein sequences in the
alignment.
A5. Go to Alignment|Align by ClustalW or click on
sequences using the ClustalW algorithm.
to align the selected protein
A6. Save the current alignment by selecting the Data|Save Session. Save it as ‘FT.mas’.
This will allow the current alignment to be restored for future editing. Also,
export it (Data|Export Alignment|FASTA format) as both a FASTA file
(‘FT.fas’) and a MEGA file (‘FT.meg’).
B. Generating a publishable MSA using BoxShade
B1. Using Word, open the previously created FASTA file (‘FT.fas’). Copy the FASTA
sequences (including gaps). Past them in BOXShade:
http://www.ch.embnet.org/software/BOX_form.html. In the ‘Output format’
select RTF_new and in the ‘Input sequence format’ select other. Click on Run
BOXSHADE. Click On ‘here is your output number 1’. The alignment will be
open in a Word document.
1
C. Exploring the MSA and identifying patterns
C1. Back in MEGA4, exit the Alignment Explorer window by selecting the Data|Exit
AlnExplorer. A dialog box will appear asking you if you would like to open the
data file in MEGA; click on ‘Yes’.
C2. Observe different coloring schemes by clicking on: C: conserved residues (the same
amino acid at a given site in all the aligned sequences), V: variable residues (at
least 2 different amino acids at a given site), Pi: Parsimony informative (at least 2
different amino acids at a given site and at least 2 of them occurring with a
minimum frequency of 2), S: singletons (at least 2 different amino acids at a given
site with at most 1 of them occurring multiple times).
(When you have a coding DNA sequence you can translate it into a protein
sequence by clicking on UUC->Phe. Clicking again you go back to the DNA
sequence).
-
Can you discover some groups by looking at the Pi characters?
-
Move sequences to have OsFT2 close to TaFT2, and also TaFT, OsFTa, and
OsFTb close to each other. Can you see patterns now?
C3. To see the format of a MEGA file, in the MEGA4 window, go to File|Export Data,
and click on OK to take a look at it. Exit (File|Exit Editor) this window.
C4.
C4.
Mutations
T
V
L
TaFT2
Q
D
P
Which of the 3 mutations found in a TILLING screen of TaFT2 would you prioritize for
characterizing a non-functional TaFT2 gene?
BLOSUM62 information for mutations: TQ= -1; VD=-3; LP=-3
BLOSUM62 information for changes at the mutation positions: TI= -1; VI=3; IL=2; EL=-2)
Maximize the conservation of the position and the negative impact of the mutation…
C5. Using T-COFFEE as a consistency based program
Copy the sequences below and open t-COFFEE in your web browser: http://tcoffee.vitalit.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi. Use the Regular form of T-COFFEE. Paste
the sequences in the INPUT window and press submit. Click on the link for score_pdf
2
and save the file. (NOTE: Once the file is saved you may need to rename it so that it is a
.pdf file, or it may not open properly.)
>OsVIL1
MASSAGGDPPPPGLFAAALHACSGASALEEHIHADDSNTISDNTLEQLGFLDQESNDASVNTEKIQSSTPKCKSVEDIPIAPAAKRCKN
MDSKKLVPNSNNNSCLTGSQAPRKLPRKGDYPVQLRRNETFQDTKPPSTWICKNAACKAVLTADNTFCKRCSCCICHLFDDNKDPSLWL
VCSSETGDRDCCESSCHIECALQHQKVGCVDLGQSIQLDGNYCCAACGKVIGILGFWKRQLMVAKDARRVDILCSRIYLSHRLLDGTTR
FKEFHKIVEDAKAKLETEVGPLDGTSSKMARGIVGRLPVAADVQKLCSLAIDMADAWLKSNCKAETKQIDTLPAACRFRFEDITTSSLV
VVLKEAASSQYHAIKGYKLWYWNSREQPSTRVPAIFPKDQRRILVSNLQPCTEYAFRIISFTEYGDLGHSECKCFTKSVEIIHKNMEHG
AEGCSSTAKRDSKSRNGWSSGFQVHQLGKVLRKAWAEENGCPSEACKDEIEDSCCQSDSALHDKDQAAHVVSHELDLNESSVPDLNAEV
VMPTESFRNENICSPGKNGLRKSNGSSDSDICAEGLVGEAPAMESRSQSRKQTSDLEQETYLEQETGADDSTLLISPPKHFSRRLGQLD
DNYEYCVKVIRWLECSGHIEKDFRMKFLTWFSLRSTEQERRVVITFIRTLADDPSSLAGQLLDSFEEIVSSKKPRTGFCSKLWH*
>TmVIL1
MESTGGDPSGFAAAALHASSDVSEHEEIKPADDSNTISDYAQEPLNFFPEQESNDASVSTEKKESVVSKCKSVEEIPREATVKRCKNID
SKKLFSNNKNSPSLTGIQALRKPPRKGPHPIQLRESEMFQDKKPPSTWICKNAACKAVLTSENTFCKRCSCCICHLFDDNKDPSLWLVC
SSETGDTDCCESSCHVECALQRRKAGRIDLGQSMHLDGNYCCAACGKVIGILGFWKRQLAVAKDARRVDILCSRIYLSHRLLDGTTRFK
ELHQIVQDAKAKLETEVGPLDGSSKMARCIVGRLPVAADVQKLCSLAMEKVDDWLQSNSQAETKQIDTLPTACRFRFEDITASSLVIVL
KETASSQYHAIKGYKLWYWNSREPPSTGEPVIFPKDQRRILISNLQPCTEYAFRIISFVEDGELGHSESKCFTRSVEIMHKNIEHGAEG
CSSTAKRNVKRHNGRSSGFKVRQLGKVLRRAWEEDGFPSEFCKDEIEDSCDQSDSVILEKGQVAHVVSRKLDLNETSVPDLNAEVVMPT
ECLRNENAYSSGKNDLRKSNGCGDFATCTEGHVGEAPAMESRSQSRKQTSDLEQETCAEDGNLVIGSQRHFSRRLGELDNNYEYCVKTI
RWLECCGHIEKEFRMRFLTWFSLRSTEQERRVVLTFIRTLVDEPGSLAGQLLDSFEEIVASKRPRTGFCTKLWH*
>OsVIL2
MDPPYAGVPIDPAKCRLMSVDEKRELVRELSKRPESAPDKLQSWSRREIVEILCADLGRERKYTGLSKQRMLEYLFRVVTGKSSGGGVV
EHVQEKEPTPEPNTANHQSPAKRQRKSDNPSRLPIVASSPTTEIPRPASNARFCHNLACRATLNPEDKFCRRCSCCICFKYDDNKDPSL
WLFCSSDQPLQKDSCVFSCHLECALKDGRTGIMQSGQCKKLDGGYYCTRCRKQNDLLGSWKKQLVIAKDARRLDVLCHRIFLSHKILVS
TEKYLVLHEIVDTAMKKLEAEVGPISGVANMGRGIVSRLAVGAEVQKLCARAIETMESLFCGSPSNLQFQRSRMIPSNFVKFEAITQTS
VTVVLDLGPILAQDVTCFNVWHRVAATGSFSSSPTGIILAPLKTLVVTQLVPATSYIFKVVAFSNYKEFGSWEAKMKTSCQKEVDLKGL
MPGGSGLDQNNGSPKANSGGQSDPSSEGVDSNNNTAVYADLNKSPESDFEYCENPEILDSDKASHHPNEPTNNSQSMPMVVARVTEVSG
LEEAPGLSASALDEEPNSAVQTQLLRESSNSMEQNQRSEVPGSQDASNAPAGNEVVIVPPRYSGSIPPTAPRYMENGKDISGRSLKAKP
GDNILQNGSSKPEREPGNSSNKRTSGKCEEIGHKDGCPEASYEYCVKVVRWLECEGYIETNFRVKFLTWYSLRATPHDRKIVSVYVNTL
IDDPVSLSGQLADTFSEAIYSKRPPSVRSGFCMELWH*
>TmVIL2
MDPPYAGAIIEPAKCRLMSVDEKKDLVRELSKRPQTAPDKLQSWSRRDIVEILCADLGRERKYTGLSKQRMLDYLFRVVTGKSSGPVVH
VQEKEPTLDPNTSNHQYPAKRQRKSDNPSRLPIAVNNPQTAVVPVQINNVRSCRNIACRAILSMEDKFCRRCSCCICFKYDDNKDPTIW
LSCSSDHPMQKDSCGLSCHLECALKDGRTGILPSGQCKKLDGAYYCPNCRKQHDLLRSWKKQLMLAKEARRLDILCYRIFLGHKVLFST
EKYSVLHKFVDIAKQKLEAEVGSVAGHGSMGRGIVSRLTCGAEVQKLCAEALDVMQSKFPVESPTNSQFERSNMMPSSFIKFEPITPTS
ITVVFDLARCPYISQGVTGFKVWHQVDGTGFYSLNPTGTVHLMSKTFVVTALKPATCYMIKVTAFSNSSEFVPWEARVSTSSLKESDLK
GLAPGGAGLVDQNNRSPKTNSGGQSDRSSEGVDSNNNATVYTDLNKSPESDFEYCENPEILDSDKVPHHPNGPSNNLQNMQIVAARVPE
VTELEEAPGLSASALDEEPNSTVQAALLRESSNSMEQNQRSEVPISQDASNATAGVELALVPRFVGSMPPTAPRVMETGKETGGRSFNT
KPSDNIFQNGSSKPDREPGNSSNKRSGKFEDAGHKDGCPEATYEYCVRVVRWLETEGYIETNFRVKFLTWYSLRATPHDRKIVSVYVDT
LINDPASLCGQLTDTFSEAIYSKKPPSVPSGFCMNLWH*
C6. Creating a graphical representation of amino acid conservation.
A FASTA file of the first 50 amino acids of the FT protein alignment has been saved in
the 09_Lab3 folder. Open the FASTA file, ‘FTclipped.fas’, using Microsoft Word.
Copy the FASTA alignment and paste it in the Multiple Sequence Alignment window of
WebLogo: http://weblogo.berkeley.edu/logo.cgi. Click Create Logo.
D. Calculating a Distance Matrix
D1. In the MEGA4 window, go to Distances|Compute Pairwise. In the ‘Analysis
Preferences’ window, change ‘Model’ to Amino Acid|No. of differences (leave
the default parameters in the other options). Click on Compute.
D2. See the Pairwise Distances matrix.
-
Which sequences are the closest ones?
-
Which sequences are the most distant ones?
3
D3. To see the matrix in a MEGA file and save it, go to File|Export/Print Distances,
and change the ‘Output Format’ from ‘Publication’ to ‘MEGA’. Click on
Print/Save Matrix.
D4. After you have inspected the matrix, go to File|Quit Viewer to close the Pairwise
Distances matrix.
E. Drawing a Phylogenetic Tree
E1. In the MEGA4 window, go to Phylogeny|Construct Phylogeny|Neighbor-Joining
(NJ). In the ‘Analysis Preferences’ window, in the ‘Options Summary’ tab,
change ‘Model’ to Amino Acid|No. of differences. (leave the default parameters
in the other options). Click on Compute.
E2. See the tree in the Tree Explorer window.
E3. To select a branch, left-click on it. If you right-click on a branch, you will find
several options to perform different operations on the ‘Selected subtree’. To edit
the accession labels, double-click on them. Change the branch style by selecting
the View|Tree/Branch Style.
E4. To save the tree to the clipboard and then be able to save it in a Word document, go
to Image|Copy to clipboard. Open a Word document and paste this tree. Exit the
Tree Explorer window (File|Exit Tree Explorer), without saving.
-
Use Phylogeny|Contruct Phylogeny to produce minimum evolution, maximum
parsimony and UPGMA trees. Copy and paste each of them into the same Word
document to compare them. Are the results consistent?
F. Evaluating a Phylogenetic Tree
F1. In the MEGA4 window, go to Phylogeny|Construct Phylogeny|Neighbor-Joining
(NJ). In the ‘Analysis Preferences’ window, in the ‘Test of Phylogeny’ tab, select
‘Bootstrap’ with 1,000 replications. Click on Compute.
F2. See the tree and the bootstrap values in the Tree Explorer window.
-
What is the confidence of the OsFTa-TaFT branch?
F3. Go to Image|Copy to clipboard and paste the tree into your Word document. Exit
the Tree Explorer window (File|Exit Tree Explorer), without saving.
4
Some fun tools and features for your benefit….
G. Within MEGA Alignment Explorer we can retrieve sequences directly from
GenBank
We have discovered a MADS box protein from barley (GenBank # CAB97352) and we
want to determine the closest protein in among the following three Arabidopsis proteins:
AP1= CAA78909; AGL2= AAA32732; AGL6= AAA79328).
G1. In the MEGA4 window, go to Alignment|Query databanks.
G2. In the NCBI Entrez site, select Protein database, enter the first GenBank number
CAB97352 into the search box, and click on Go. When the search result is
displayed, open it and then click on ‘Add to Alignment’.
G3. Repeat step G2. for the three Arabidopsis sequences.
G4. Align the protein sequence using ClustalW as before, save the alignment as
‘MADS.mas’, exit and open the file in MEGA.
G5. Perform a Neighbor-Joining (NJ) analysis. Copy and paste the phylogenetic tree
into your Word document.
-
Which Arabidopsis protein is the closest one to the MADS box protein from
barley?
H. Viewing the 3D structure of a protein
H1. Cn3D is an application that allows you to view 3-dimensional structures of proteins.
Go to protein blast (blastp)
(http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?PAGE=Proteins&PROGRAM=blastp&
BLAST_PROGRAMS=blastp&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on).
Copy and paste AtFT protein sequence and click on BLAST.
H2. Once your results are completely displayed, go to Show Conserved Domains.
-
What is the name of the conserved domain?
-
Click on it to find more information about
the conserved domain.
- What biological functions have been attributed
to this conserved domain?
H3. Click on Structure to go to Entrez, Structure
database. In the Structure database, insert the name
of the conserved domain you found and click on Go.
Click on the link displayed as your results. In the
Structure Summary window, click on Structure
View in Cn3D. Open the file with Cn3D. Cn3D
tutorial:
http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3dtut.sht
ml .
5
H4. Go to View|Animation|Spin for a complete view of the 3D structure of the conserved
domain. You can change the Style in which you want to see the 3D structure.
The default display presented in the figure for single structures is a combination of
Style/Rendering Shortcuts: Worms and Style/Coloring Shortcuts: Secondary
Structure, which show a worm backbone, no side chains, and solid objects - arrows
and cylinders - to represent strands and helices. The colors are green for helices,
orange for strands, and blue for coils. Arrows point in the N-to-C direction.
H5. In the Sequence/Alignment Viewer, you can see where in the 3D structure the selected
amino acids are located, by simply selecting them with your mouse. The 3D structure
will be highlighted in the position where the selected amino acids are located.
I. From Multiple Sequence Alignment to Multiple Sequence Assembly
I.1. Using MEGA4, perform a new ClustalW alignment with the 8 exported sequences
used in 09_Lab1 (simply select them all from the Word document called
‘09_Lab1 DNA for MEGA’, copy them (Ctrl C) and paste them (Ctrl V) in the
MEGA4 Alignment Explorer window).
-
Could you get a good alignment of the sequences? Why?
-
How would you find the alignment between the overlapping regions that are
present in these sequences?
6