* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Pairwise Alignments 1
Non-coding DNA wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Human genome wikipedia , lookup
Expanded genetic code wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Microsatellite wikipedia , lookup
Metagenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Helitron (biology) wikipedia , lookup
Frameshift mutation wikipedia , lookup
Genetic code wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Point mutation wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Roadmap The topics:         basic concepts of molecular biology more on Perl overview of the field biological databases and database searching sequence alignments phylogenetics structure prediction microarray data analysis Sequence alignments  Introduction What is an alignment?  Why do alignments?  A bit of history      Dot matrix comparison Scoring alignments Alignment methods Significance of alignments What is Sequence alignment Sequence alignment is an arrangement of two or more sequences, highlighting their similarity. Why do alignments? Sequence Alignment is useful for discovering structural, functional and evolutional information in biological sequences. Over time, genes accumulate mutations  Environmental factors Radiation  Oxidation   Mistakes in replication/repair Deletions, Duplications  Insertions  Inversions  Point mutations  Comparing two sequences  Point mutations, easy: ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT  Insertions/deletions, must align: ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT Sequence Alignment Doolittle RF, Hunkapiller MW, Hood LE, Devare SG, Robbins KC, Aaronson SA, Antoniades HN. Science 221:275-277, 1983. Russell F. Doolittle  A sequence for platelet derived growth factor (PDGF) from mammalian cells was virtually identical to the sequence for the retrovirus encoded oncogene known as v-sis (gene causing cancer in animals).  Retrovirus had acquired the gene from the host cell as some kind of genetic exchange event and then had produced a mutant that could alter the function of the normal protein when it infected another animal. Dot Matrix Comparison A: T C A G A G G T C T G B: T C A G A G C T G T T C A G G T X G X G G X X X A T X X X X X X X X X X C C X A G A X C T G X X X X X X X X X Interpretation of dot matrix  Regions of similarity appear as diagonal runs of dots  Reverse diagonals (perpendicular to diagonal) indicate inversions  Can link or "join" separate diagonals to form alignment with "gaps" More on Dot Matrix  Improving detection of matching regions by filtering  using sliding window to compare the two sequences. For example, print a dot at a matrix position only if   7 out of the next 11 positions in the sequence are identical Similarity score of the next 11 positions in the sequence is greater than 5. Sequence repeats  Many sequences contains repetitive regions. a retrovirus vector sequence against itself using a window size of 9 and mismatch limit of 2 (http://arbl.cvmbs.colostate.edu/molkit/dnadot/bkg.html) More on Dot Matrix  Dot matrix graphically presents regions of identity or similarity between two sequences  The use of windows and thresholds can reduce “noise” in dot matrix  Inversions and duplications have unique “signatures” in dot matrix Software    Dotlet (java applet)– www.ch.embnet.org Dnadot – arbl.cvmbs.colostate.edu/molkit/dnadot/ Dotter – www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html  Dottup – www.emboss.org How to measure the similarity Basically three kinds of changes can occur at any given position within a sequence:  Mutation  Insertion  Deletion  Insertion and deletion have been found to occur in nature at a significantly lower frequency than mutations. Scoring Matrices for Aligning DNA Sequences A T C G A 1 0 0 0 A T 0 1 0 0 T -4 5 C 0 0 1 0 C G 0 0 0 1 G -4 -4 -4 5 Identity matrix A T 5 C G -4 -4 -4 -4 -4 -4 -4 5 BLAST matrix -4 A T A 1 G -5 -5 -1 T -5 1 C C -1 -5 -5 -1 1 -5 G -1 -5 -5 1 Transition-Transversion matrix Transition --- substitutions in which a purine (A/G) is replaced by another purine (A/G) or a pyrimadine (C/T) is replaced by another pyrimadine (C/T). Transversions --(A/G)  (C/T) Scoring a sequence alignment    Match score: Mismatch score: Gap penalty: +1 +0 –1  ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT  Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1)   Score = +11 Gap opening and extension penalties   We want to find alignments that are evolutionarily likely. Which of the following alignments seems more likely to you? ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGAT-------ATAGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT AC-T-TGA--CG-CGT-TA-TCTATCT    We can achieve this by penalizing more for a new gap, than for extending an existing gap Scoring a sequence alignment       Match/mismatch score: +1/+0 Open/extension penalty: –2/–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Open: 2 × (–2) Extension: 5 × (–1) Score = +9 Amino Acid Substitution Matrices PAM - point accepted mutation based on global alignment [evolutionary model] BLOSUM - block substitutions based on local alignments [similarity among conserved sequences] Part of PAM 250 Matrix C S T P A G C 12 0 -2 -3 -2 -3 Log-odds = log ( S T P A G 2 1 1 1 1 3 0 1 0 6 1 -1 2 1 5 chance to see the pair in homologous proteins ) chance to see the pair in unrelated proteins by chance PAM matrices PAM 1 Matrix reflects an amount of evolution producing on average one mutation per hundred amino acids (1 unit evolution). PAM 250 --- 250 unit evolution Amino acid change Probability PAM 1 PAM 250 Phe to Ala 0.0002 0.04 Phe to Arg 0.0001 0.01 Phe to Asn 0.0001 0.02 Phe to Asp 0.0000 0.01 Phe to Cys 0.0000 0.01 ... … … Limitations of PAM Matrices     Constructed based on the phylogenetic relationships prior to scoring mutations; Difficulty of determining ancestral relationships among sequences; Based on a small set of closely related proteins; … BLOSUM Matrices    Based on the observed amino acid substitutions in a large set of ~2000 conserved amino acid patterns (blocks). The blocks are found in a database of protein sequences representing more than 500 families of related proteins and act as signatures of these protein families. The matrices are measured on the multiple alignment of the blocks. The entries of the matrices are computed based on the same principle used in PAM -- log(odds’ ratio). Part of BLOSUM 62 Matrix  BLOSUM62 was measured on pairs of sequences with an average of 62 % identical amino acids. Log-odds = log ( C S T P A G C 9 -1 -1 -3 0 -3 S T P A G 4 1 -1 1 0 5 -1 7 0 -1 4 -2 -2 0 chance to see the pair in homologous proteins chance to see the pair in unrelated proteins by chance 6 ) PAM vs. BLOSUM  PAM Based on mutational model of evolution (Markov process)  PAM1 is based on sequences of 85% similarity  Designed to track the evolutionary origins   BLOSUM Based on the multiple alignment of blocks  Good to be used to compare distant sequences  Designed to find proteins’ conserved domains  Gap Penalty  Optimal penalties vary from sequence to sequence, and finding the most adequate value is a matter of empirical trial and error.  When compare distantly related sequences, a high gapopening penalty and a very low gap-extension penalty often give better results  When compare closely related sequences, gaps should be penalized on both a gap-opening and gap-extension