* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Phylogenetic analysis
Survey
Document related concepts
Transcript
Phylogenetic Analysis YTSLLLSRQYASLLW-RQA PASIILSRQA GRSIVLTRQM Phylogenetics What do I need to do? Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly Phylogenetics Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly So you have a sequence… now what? MKILLLCIIFLYYVNAFKNTQKDGVSLQILK KKRSNQVNFLNRKNDYNLIKNKNPSSSLK STFDDIKKIISKQLSVEEDKIQMNSNFTKDL GADSLDLVELIMALEEKFNVTISDQDALKI NTVQDAIDYIEKNNKQ #1: What is it? Does source organism have it’s own genome database? Unknown/No BLAST @ Pubmed Yes BLAST @ genome database (GeneDB, PlasmoDB, etc.) Why start with genome-specific database? Genome location/structure Strain variability BLAST Expression data Pathway data PubMed BLAST PubMed BLAST Blastp Protein families – Conserved Domains BLAST Hits Downloading sequences – FASTA format Getting sequences – FASTA format Saving and editing FASTA files Phylogenetics Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly GYTSLLLSRQNED--G G--SLLLSHK-D-HTG GYTSLLLSRQNEDG---GSLLLSHK-D-HTG TSLLLSR TSLLLSH Global Overlap Local Pair-wise sequence alignment Smith-Waterman Aligning 2 sequences globally - Y T S L L L S R Q YTSLLLSRQ YASLLWRQA YTSLLLSRQYASLLW-RQA Y A S L L W R Q A -4 -8 -12 -16 -20 -24 -28 -32 -36 -4 4 -8 -12 -16 -20 -24 -28 -32 -36 -8 -4 2 -12 -16 -20 -24 -28 -32 -36 -12 -4 -8 10 -24 -28 -32 -36 -16 -4 -8 -12 14 -20 -4 -8 -12 -16 18 14 10 -32 -36 -24 -19 -8 -12 -16 -20 14 10 6 -36 -28 -4 15 11 -32 -25 -29 -24 -16 -20 -24 -28 -32 20 -36 -26 -25 -34 -25 -35 -28 -28 -32 16 -16 -20 -20 -20 -12 -16 -20 -24 -28 -32 -36 -24 -28 Multiple sequence alignment Progressive YTSLLLSRQYASLLW-RQA Align 2 closest sequences YTSLLLSRQYASLLW-RQA PASIILSRQA Add in next closest sequence YTSLLLSRQYASLLW-RQA PASIILSRQA GRSIVLTRQM Continue adding…. Hyper dependent on initial matches. Multiple sequence alignment Iterative YTTSLLLSRQ-YATSLLWRQA-PASIILSRQA-GRTSIVLTRQMA YTTSLLLSRQ-YATSLLW-RQ-A PA-SIILSRQ-A GRTSIVLTRQMA Initial MSA Score (low) Optimize MSA score Probabilistic methods don’t always generate the same answer Multiple sequence alignment programs Local Global progressive POA ClustalX T-Coffee iterative MSA Alignment type Pair-wise alignment type Dialign HMMs GAs Multiple Sequence Alignments POAVIZ – progressive local CLUSTAL – progressive global Multiple Sequence Alignments POAVIZ – progressive local CLUSTAL – progressive global POAVIZ POAVIZ POAVIZ Multiple Sequence Alignments POAVIZ – progressive local CLUSTAL – progressive global CLUSTALX Parameters CLUSTALX CLUSTALX – Protein Weight Matrices • 1) BLOSUM (Henikoff). These matrices appear to be the best available for carrying out data base similarity (homology searches). • 2) PAM (Dayhoff). These have been extremely widely used since the late '70s. • 3) GONNET. These matrices were derived using almost the same procedure as the Dayhoff one (above) but are much more up to date and are based on a far larger dataset. BLOSUM (BLOck SUbstitution Matrix) BLOSUM62 – Gather proteins with at least 62% identity to obtain actual substitution rates for these proteins BLOSUM99 ----------------------------------------------------->BLOSUM62 >99% identity >62% identity Pros Best bet for distantly divergent sequences PAM (point accepted mutation) Gather the substitution rates for PAM1 (99% identical sequences) Assuming that those substitution rates are consistent over time…: (# Point mutations / 100 amino acids) PAM1 ------------------------------------------------------------->PAM250 99% identity 20% identity Pros Very good for closely related sequences Cons Rare mutations under-represented Substitution rates not constant over time (both are problems for phylogenetic estimation) CLUSTALX CLUSTALX - Aligning CLUSTALX - Aligning CLUSTALX – Alignment view CLUSTAL vs POAVIZ (global vs local) POAVIZ CLUSTAL Phylogenetics Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly BioEdit – Alignment manipulation Open the “.aln” file BioEdit – Alignment manipulation “Back colored view” gives more contrast Select “Edit” from the mode dropdown BioEdit – Alignment manipulation Select “Insert” so that you don’t accidentally lose part of your sequence Then select the unaligned beginning (or end) sequence and delete it…. BioEdit – Alignment manipulation Now save as a different file .fasta Phylogenetics Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly Tree terminology root common ancestor (node, branch point) A B C D E F Operational taxonomic units (OTUs, leaves) G Topology 1 monophyletic A B C D E F G Topology 2 paraphyletic A B C E D F G Topology 3 polyphyletic A E B F G C D Sequence homology – orthologues and paralogues Ancestral gene duplication A B Last common ancestor speciation Human A Rat A Human B orthologues Rat B A B A orthologues paralogues orthologues paralogues B Methods of estimating phylogenetic relationships Character-based Maximum Parsimony (MP) Distance-based Neighbor-Joining (NJ) Minimum Evolution (ME) Probabilistic Maximum likelihood (ML) Bayesian inference Methods of estimating phylogenetic relationships Maximum Parsimony (MP) Taxa1 Taxa2 Taxa3 Taxa4 AAA AAA 1 1 AAA AGA 1 AAG AAA GGA AGA 3 changes required (best tree) AAG AAA GGA AGA AAA 1 AAA AAA 1 AAG AGA AAA 2 AAA GGA 4 changes required 1 AAG AAA 2 GGA 1 AAA AGA 4 changes required Methods of estimating phylogenetic relationships Distance-based Neighbor-Joining (NJ) Method The NJ method involves clustering of neighbor species that are joined by one node. It does not evaluate all the possible tree topologies. Not guaranteed to obtain the optimal tree Minimum Evolution (ME) Method Estimates the total branch length of each topology exhaustively, then chooses the topology with the least total branch length. Time intensive for large numbers of taxa. Methods of estimating phylogenetic relationships Prob ( data | model + tree ) Probabilistic methods Maximum likelihood (ML) More likely topology found Search all possible topologies to optimize probability Bayesian inference P(You _ getting _ picked ) Prior information Model for selection need both for everyone in the class You People in the class Methods of estimating phylogenetic relationships Character Maximum Parsimony (MP) Distance Neighbor-Joining (NJ) Minimum Evolution (ME) Probabilistic Maximum likelihood (ML) Bayesian inference Estimating Phylogenetic Relationships MEGA MrBayes Estimating Phylogenetic Relationships MEGA MrBayes MEGA – Molecular Evolutionary Genetic Analysis First we have to get a MEGA formatted file made Select ‘All Files [“ “]’ from the dropdown ‘Files of Type’ menu Then choose the ‘.aln’ file you just made… MEGA – making a MEGA formatted file MEGA recognizes that you didn’t enter a MEGA formatted file… Click ‘OK’ Now click on the ‘Convert to MEGA format’ button at the top left hand side of the screen MEGA – making a MEGA formatted file Make sure that the file is the right one and that the formatting is correct. Click ‘OK’. Now we have to make sure that the file looks good before starting any analysis MEGA – making a MEGA formatted file -Make sure all sequences are the same length -Remove all traces of the consensus marks When the file looks good, save it and close both text formatter windows… Now try ‘Activating the data file’ again, this time with the ‘.meg’ file you just made… MEGA – input a MEGA formatted file Make sure that the correct sequence type is selected Make sure that the correct characters are selected for missing data and gaps. MEGA – input a MEGA formatted file You should now see the ‘sequence data explorer’ Minimize this window and you can begin analyzing your data… MEGA – choose an algorithm From the phylogeny window you can choose an appropriate algorithm. In this case we’ll use Minimum Evolution. MEGA – set parameters There are two major things to think about first: ‘Model’ and ‘Rates among Sites’ In this example, I’ll use the Poisson model with gamma (y=2.0) rate variation Identity Substitution rates Base frequencies Transition and/or transversion frequencies Symmetrical substitution (G->A = A->G) Variable Variable Kimura 2-parameter: B(E), si(V), sv(V) Tamura-Nei: B(V), si(V), sv(E) Kimura 3-parameter: B(V), si(E), sv(V) General Time Reversible: B(V), Sym Rate variation across sites Gamma ( Γ )distribution of rate variation among sites Proportion of Invariable Sites ( I ) Γ + I + GTR Substitution models (nucleic acid) Equal Equal Sophistication Mixture models Site specific residue frequencies Poisson mtREV JTT PAM Identity Each site can choose it’s own substitution model, and coupled with maximum likelihood probability estimations or MCMC/Bayesian methods High dimensional model but requires large dataset probabilistic substitution rates extrapolation of observed substitution rates No model Substitution models (amino acid) MEGA – set parameters There are two major things to think about first: ‘Model’ and ‘Rates among Sites’ In this example, I’ll use the Poisson model with gamma (y=2.0) rate variation MEGA – choose tree test options Now switch over to the ‘Test of Phylogeny’ tab.. In order to determine the validity of your tree you’ll need to bootstrap it. Since our sequence isn’t very long, only a couple hundred replications are needed. Now click the check button, then click ‘Compute’ in the main window… MEGA – edit your tree Your tree should appear. Not a very good one in this case. Why? Because the sequences were too identical. The icons on the left allow you to reroot, flip branches, etc. You can also change the format of the tree But let’s also compute a condensed tree…(Select that from the ‘Compute’ menu) using a cutoff of 50%.. MEGA – interpret the tree Four of the sequences cluster indistinguishably together, while a single other sequence stands out. If we look back at our alignments we could predict this… Estimating Phylogenetic Relationships MEGA MrBayes MrBayes – Making a NEXUS (.nex) file MrBayes – Making a NEXUS (.nex) file MrBayes – Running MrBayes MrBayes – Running MrBayes MrBayes – Running MrBayes MrBayes – Running MrBayes MrBayes – Running MrBayes MrBayes – Running MrBayes Phylogenetics Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly Phylogenetics Interpret results correctly Quality of aligned sequences One bad egg Sequence similarity (think goldilocks) Use an appropriate model Use an appropriate estimation method Use appropriate parameters Try different things and compare results wisely Determine the validity of each part of your tree Develop a model to explain your tree how does it square with known information? what can you learn from your sequences? what can’t you learn from your analysis? The Intelligent Consumer (You don’t have to completely understand everything in order to use it properly, but it helps to have a rough idea…) BLAST - stochastic processes - random walks Sequence alignments - Markov processes - dynamic programming - Viterbi, Forward, and Backward algorithms Bayesian phylogenetic inference - Bayes theorem - Bayesian inference - Metropolis algorithm Many uses for multiple sequence analysis… Protein family analysis 1 2 1 1 2 2 1 2 multiple sequence alignment profile profile–HMM (hidden Markov model) 2 1 2 2 1 1 Find new proteins with same domains RNA secondary structure prediction Protein secondary structure prediction Protein structure prediction – homology modeling Protein sequence with known structure Aligned sequences with unknown structure Comparative genomics