* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download sequence - Université d`Ottawa
Molecular cloning wikipedia , lookup
Genetic code wikipedia , lookup
DNA barcoding wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Community fingerprinting wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Non-coding DNA wikipedia , lookup
Molecular evolution wikipedia , lookup
Point mutation wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
EVOLUTIONARY CHANGE IN DNA SEQUENCES - usually too slow to monitor directly… spontaneous mutation rates? p. 35-37 for mammalian nuclear DNA (regions not under functional constraint) ~ 4 x 10 -9 nt sub per site per year ... much higher for viruses eg. 10 -6 to 10 -3 nt sub per site per generation … so use comparative analysis of 2 sequences which share a common ancestor - determine number and nature of nt substitutions that have occurred (ie measure degree of divergence) Potential pitfalls 1. Are all evolutionary changes being monitored? - if closely-related, high probability only one change at any given site… but if distant, may have been multiple substitutions (“hits”) at a site - can use algorithms to correct for this 2. If indels between two sequences, can they be aligned with confidence? - algorithms with gap penalties Ancestral sequence Present day sequences Fig. 3.6 Homoplasy: same nt, but not directly inherited from ancestral sequence (If comparing long stretches, highly unlikely they would have converged to the same sequence) Page & Holmes Fig. 5.9 Nucleotide substitutions within protein-coding sequences 1. Synonymous vs. non-synonymous Single step: Multiple steps: AAT ACT Is one pathway more likely than another? p.82 2. Nomenclature related to “degeneracy”: Non-degenerate - all possible changes at site are non-synonymous 2-fold degenerate - one of the 3 possible changes is synonymous 4-fold degenerate - all possible changes at site are synonymous ALIGNMENT OF SEQUENCES FOR COMPARATIVE ANALYSIS 1. By manual inspection - if sequences very similar and no (or few) gaps 2. By sequence distance methods (often followed by “correction by visual inspection”) - use algorithms which minimize mismatches and gaps - gap penalty > mismatch penalty Alignment of human and chicken pancreatic hormone proteins no gap penality with gap penalty alignment as in (b), with biochemically similar aa Fig. 3.12 Multiple sequence alignments - CLUSTALW ww.ebi.ac.uk/clustalw (European Bioinformatics Institute) CLUSTAL W (1.81) Multiple Sequence Alignments Sequence 1: ArabidopsisAAG52143 Sequence 2: ArabidopsisAAC26676 Sequence 3: yeast 798 aa 845 aa 664 aa Sequences (2:3) Aligned. Score: 23 Sequences (1:2) Aligned. Score: 93 Sequences (1:3) Aligned. Score: 22 ArabAAG52143 ArabAAC26676 yeast FIVDEADLLLDLGFRRDVEKIIDCLPRQR-------QSLLFSATIPKEVRRVS-QLVLKR 539 FIVDEADLLLDLGFKRDVEKIIDCLPRQR-------QSLLFSATIPKEVRRVS-QLVLKR 586 -VLDEADRLLEIGFRDDLETISGILNEKNSKSADNIKTLLFSATLDDKVQKLANNIMNKK 323 ::**** **::**: *:*.* . * .:. ::******: .:*:::: ::: *: Symbols used? * : . Alignment of human a-globin and b-globin proteins a b Human a globin = 141 aa Human b globin = 146 aa b globin a globin Was D-helix loss neutral or adaptive mutation? (Nature 352: 349-51, 1991) Avers Fig. 3.23 Reminder about definition of the word “homology” In sequence comparisons, refer to nt (or aa) sequence relatedness as “… % identity” or “...% similarity” BUT NOT “ … % homology” because “homology” means “shares a common ancestor” “Non-evolutionary biologists” Petsko Genome Biol. 2:1002,2001 “Normalized alignment score” NAS = (# identities x 10) + (# Cys identities x 20) – (# gaps x 25) Doolittle, R. “URFs & ORFs” p.14 BLAST searches www.ncbi.nlm.nih.gov/BLAST/ - to detect similarity between “sequence of interest” & databank entries Query = yeast mt ribosomal protein L8 gene (1275 nt) Example of high score “hit” (red) Score = 383 bits (193), Expect = 1e-102 Identities = 196/197 (99%), Gaps = 0/197 (0%) Query Sbjct AGCGTCAGGATAGCTCGCTCGATGTGGTCAGGCTAACACAATGAACAACGAGACTAGTG |||||||||||||||||||||||||| |||||||||||||||||||||||||||||||| AGCGTCAGGATAGCTCGCTCGATGTGATCAGGCTAACACAATGAACAACGAGACTAGTG E-values: statistical measure of likelihood that sequences with this degree of similarity occur randomly ie. reflects number of hits expected by chance Example of low score “hit” (blue or black) Score = 40.1 bits (20), Expect = 3.6 Identities = 23/24 (95%), Gaps = 0/24 (0%) Query Sbjct GTTTTCTTAATATTTATTTAAAAA |||||||||||||||| ||||||| GTTTTCTTAATATTTAATTAAAAA “low complexity sequence” Why is “sequence complexity” important when judging whether two sequences are homologous? AAGAGGAG Pu-rich region #2 (not homologous to #1) Human DNA Chimp DNA Pu-rich region #1 Region of unbiased base composition G=C=A=T AAGAGGAG How frequently is AAGAGGAG (8-nt sequence) expected to occur by chance in a DNA sequence? If sequence A is of low complexity (or short length), high % identity with sequence B may not reflect shared evolutionary origin Advantages of using aa (rather than nt) sequences for identifying homologous genes among organisms? - lower chance of “spurious” matches -20 amino acids vs. 4 nucleotides - unrelated nt sequences (non-homologous) expected to show 25% identity by random chance (if unbiased base composition) - degeneracy of genetic code & different codon usage patterns (and G+C% of genomes) among organisms - for distantly related sequences – “saturation” of synonymous sites within codons (multiple hits) But… for certain phylogenetic analyses, number of informative characters may be higher at DNA than protein level What if BLAST search were done at protein (instead of nt) level? Query = yeast mitochondrial ribosomal protein L8 (238 aa) Fungal Bacterial Dot matrix method for aligning sequences - 2 sequences to be compared along X and Y axis of matrix - dots put in matrix when nts in the 2 sequences are identical mismatch = “gap” (or break) in line Fig. 3.7 indel = shift in diagonal Fig. 3.7 Dot matrix method - normally compare blocks rather than individual nts - spurious matches (background noise) influenced by 1. window size – overlapping fixed-length windows whereby sequence 1 compared with seq 2 2. stringency – minimum threshold value (% identity) at each step to score as hit - for coding regions, could use aa instead of nt sequences to reduce “noise” 2004 sequence (fewer errors) Comparison of human chromosome 7 “draft” sequence (2001) with “near-complete” sequence (2004) Blowup of 500 kb region 2001 sequence How do you interpret the data in this figure? Nature 431:935, 2004