Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Non-coding DNA wikipedia , lookup
Pharmacometabolomics wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Community fingerprinting wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Biochemistry wikipedia , lookup
Biosynthesis wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Molecular ecology wikipedia , lookup
Multilocus sequence typing wikipedia , lookup
Genetic code wikipedia , lookup
Point mutation wikipedia , lookup
Multiple Sequence Alignments Profiles and Progressive Alignment Profiles for families of sequences can be built from MSAs 1 1 2 3 C G A A 2 3 A 50% 75% 25% — C 25% A T T 0% A A G 0% 25% — A — — 25% 0% 0% 0% 25% 0% 0% 50% Note: While profiles can be used for any kind of sequence data, we’ll focus on protein sequences Profiles • Profile: A table that lists the frequencies of each amino acid in each position of protein sequence. • Frequencies are calculated from a MSA containing a domain of interest • Allows us to identify consensus sequence • Derived scoring scheme allows us to align a new sequence to the profile – Profile can be used in database searches – Find new sequences that match the profile • Profiles also used to compute multiple alignments heuristically – Progressive alignment Profiles: Position-Specific Scoring Matrix (PSSM) • To compare a sequence to a profile, need to assign a score for each amino acid • The score the profile for amino acid a at position p is 20 M ( p , a ) f (p , b ) s (a , b ) b 1 where – f(p,b) = frequency of amino acid b in position p – s(a,b) is the score of (a,b) (from, e.g., BLOSUM or PAM) Profiles: PSSM Insertion/deletion penalty Gribskov et al. PNAS. 84 (13): 4355 (1987) Profiles: Consensus Sequence • A consensus residue C(p) is generated at each position of the profile to aid the display of alignments of target sequences with the profile. • The consensus residue c is the amino acid at p that has the highest score M(p,c). – c is the amino acid most mutationally similar to all the aligned residues of the probe sequences at p, rather than the most common one Aligning a sequence to a profile K K K M L L M L M K M – – L L L New sequence: K K L L K K – M 1 2 3 4 5 K .75 .25 .75 L .75 .75 M .25 .25 .50 .25 .25 .25 .25 M Align with profile: K K L - L M 1 - 2 3 4 5 K K K K M K - L L L M L M K M – L – L L L M K K – M Scoring a sequence-to-profile alignment • Score each column separately according to PSSM • Each character contributes to score, weighed by its frequency 1 2 3 4 5 K .75 .25 .75 L .75 .75 M .25 .25 .50 .25 - .25 .25 .25 K 1 K - L 2 3 L 4 M 5 Column 1 score: 0.75 s(K,K) + 0.25 s(K,M) Profile-to-sequence alignments • Optimum alignment can be found by dynamic programming – Extension of Needleman-Wunsch • Spaces are only added to msa – never removed – Once a gap, always a gap • Can align profiles to profiles Evolutionary Profiles • Profiles just seen are called average profiles • Generally perform well, but disregard some of the biology – How did each position evolve? – Amount of conservation varies from position to position – Type of conservation varies from position to position • Alternative: Evolutionary profiles – Gribskov, M. and Veretnik, S., Methods in Enzymology 266, 198-212, 1996 Evolutionary Profiles • Idea: Fit a different model at each position • For each position i : – For each possible ancestor b for position i • Try various evolutionary distances x (assume PAM model), and choose the one that minimizes cross entropy 20 where H fa ln pa a 1 – fa = observed frequency of a – pa= predicted frequency of a assuming b is the ancestor and x is the distance • This generates 20 distributions for position i Evolutionary Profiles • For each position i – Compute “mixture coefficient,” Wai, measuring likelihood that the residue a generated observed distribution (see text) – Profile is given by where • paij = frequency of residue j in the ancestral residue distribution a at position i • prandom j = frequency of residue j in the database Progressive multiple alignment • Feng & Doolittle 1987, Higgins and Sharp 1988 • Idea: Sequences to be aligned are phylogenetically related – these relationships are used to guide the alignment • Popular implementations: CLUSTALW, PILEUP, T-Coffee CLUSTALW 1. Perform pair-wise alignments between all pairs of sequences (n x (n-1)/2 possibilities) 2. Generate distance matrix. • Distance between a pair = number of mismatched positions in alignment divided by total number of matched positions 3. Generate a Neighbor-Joining ‘guide tree’ from distance table 4. Use guide tree to progressively align sequences in pairs from tips to root of tree. • • Actually, align profiles “Once a gap, always a gap” CLUSTALW CLUSTALW Tree Tree calculated from an alignment of more than 1100 ring finger domains, using ClustalW 1.83. CLUSTALW heuristics 1. Individual weights are assigned to each sequence in a partial alignment in order to downweight similar sequences and up-weight highly divergent ones. 2. Varying substitution matrices at different alignment stages according to sequence divergence. 3. Gaps • Positions in early alignments where gaps have been opened receive locally reduced gap penalties • Residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Progressive Alignment: Discussion • Strengths: – Speed – Progression biologically sensible (aligns using a tree) • Weaknesses: – No objective function. – No way of quantifying whether or not the alignment is good Problems with CLUSTALW • Local minimum problem: – Alignment depends on sequence addition order. – With each alignment some proportion of residues are misaligned • Worse for divergent sequences – Errors get “locked in” and propagate as sequences are added – Can result in arbitrary and incorrect alignments • Clustal uses global alignment … may not be accurate for all parts of the sequence – T-Coffee considers local similarity as well as global Iterative alignment • To avoid local minima, realign subgroups of sequences and then incorporate them into a growing multiple sequence alignment – Improves overall alignment score. – May involve rebuilding the guide tree – May be randomized • Programs: – MultAlin – PRRP – DIALIGN Phylogenetic Alignment Given a tree for a set of species S, find ancestral species such that total distance is minimized. GTGG CTGG CTGG GTGG CCGG CTAA GTAA CTTC