* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture 27
Microevolution wikipedia , lookup
Genomic library wikipedia , lookup
Koinophilia wikipedia , lookup
Metagenomics wikipedia , lookup
Frameshift mutation wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Human genome wikipedia , lookup
Genome evolution wikipedia , lookup
Protein moonlighting wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome editing wikipedia , lookup
Helitron (biology) wikipedia , lookup
Sequence alignment wikipedia , lookup
Expanded genetic code wikipedia , lookup
FCH 532 Lecture 7 Chapter 5: DNA sequencing Chapter 7 2 new HW assignments Test next Friday Genome sequencing • • • • • • • • • In order to sequence entire genomes, segments need to be assembled into contigs (contiguous blocks) to establish the correct order of the sequence. Chromosome walking may be one way to do so, but is prohibitively expensive. Two methods have been used recently: 1. Conventional genome sequencing-low resolution maps made by identifying “landmarks” in ~250 kb inserts in YACs. Landmarks are 200-300 bp segments, aka sequence tagged sites(STSs)-2 clones with the same STS overlap. STS-containing inserts are sheared randomly into ~40kB segments and cloned into cosmid vectors-used to create high resolution maps. The cosmid inserts are fragmented to smaller sizes and sequenced. Cosmid inserts are assembled by using the STS sequence overlaps and cosmid walking. Cannot be used effectively with sequences containing high amounts of repetitive sequence. (Use expressed sequence tags (ESTs)). Genome sequencing • 2. Shotgun strategy– genome library is randomly fragmented – large amount of cloned fragments are sequenced. – Genome is assembled by identifying overlaps between pairs of fragments. • • The probability that a base is not sequenced is e-c, c is the redundancy of coverage, c = LN/G, – where L is the average length of the cloned inserts in base pairs, – N is the number of inserts sequenced, – and G is the length of the genome in base pairs. • • • • The aggregate length of the gaps between contigs is G e-c and the average gap size is G/N. Bacterial genomes-shotgun strategy is straightforward. Gaps are filled in by synthesizing PCR primers and finishing a genome. Eukaryotic genomes-larger size so it must be carried out in stages using BACs and then identifying ~500 bp sequences from each to yield sequence tagged connnectors (STCs or BAC ends) This allows assembly via the overlapping of STCs. Page 180 Figure 7-17 Genome sequencing strategies. Human genome • 2.2 billion nucleotide sequence ~90% complete because of highly repetitive sequence. • About half of the human genome consists of various repeating sequences. • Only ~28% of the genome is transcribed to RNA • Only 1.1% to 1.4% of the genome (~5% of the transcribed RNA) encodes protein. • Only ~30,000 protein encoding genes (open reading frames or ORFs) identified. Predicted 50,000 - 140,000 ORFs. • Only a small fraction of human protein families are unique to vertebrates; most occur in other life forms. • Two randomly selected human genomes differ, on average, by only 1 nucleotide per 1250; that is, any 2 people are likely to be >99.9% identical. Human genome • 2.2 billion nucleotide sequence ~90% complete because of highly repetitive sequence. • About half of the human genome consists of various repeating sequences. • Only ~28% of the genome is transcribed to RNA • Only 1.1% to 1.4% of the genome (~5% of the transcribed RNA) encodes protein. • Only ~30,000 protein encoding genes (open reading frames or ORFs) identified. Predicted 50,000 - 140,000 ORFs. • Only a small fraction of human protein families are unique to vertebrates; most occur in other life forms. • Two randomly selected human genomes differ, on average, by only 1 nucleotide per 1250; that is, any 2 people are likely to be >99.9% identical. Chemical evolution • Evolutionary aspects of amino acid sequences. • Change stem from random mutational events that alter a protein’s primary structure. • Mutational change must offer a selective advantage or at least, not decrease fitness. • Most mutations are deleterious and often lethal so they are not reproduced. • Sometimes mutations occur that increase fitness of the host in its natural environment. • Example: Sickle-cell anemia. Page 183 Figure 7-18a Scanning electron microscope of human erythrocytes. (a) Normal human erythrocytes revealing their biconcave disklike shape. Page 183 Figure 7-18b Scanning electron microscope of human erythrocytes. (b) Sickled erythrocytes from an individual with sickle-cell anemia. Page 184 Figure 7-20 A map indicating the regions of the world where malaria caused by P. falciparum was prevalent before 1930. Chemical evolution • Pauling and co-workers showed that normal human hemoglobin (HbA) is more electronegative than sicklecell hemoglobin (HbS). • Sickle-cell anemia is inherited according to the laws of Mendelian genetics. • Homozygous for HbS is almost all HbS, phenotype=sickle cell anemia. • Heterozygous for HbS is ~40% HBs, phenotype=sickle cell trait. • Homozygous for HbA, normal human hemoglobin. Mutations in a- or b-globin genes can cause disease state • Sickle cell anemia – E6 to V6 • Causes V6 to bind to hydrophobic pocket in deoxy-Hb • Polymerizes to form long filaments • Cause sickling of cells • Sickle cell trait offers advantage against malaria • Cells sickle under low oxygen conditions and if infected with Plasmodium falciparum. • Causes the preferential removal of infected erythrocytes from circulation. Variations in homologous proteins • • • • • • • • • Similar proteins from related species likely derived from the same ancestor. A protein that is well adapted to its function will continue to evolve. Neutral drift-mutational changes in a protein that don’t affect its function over time. Homologous proteins-evolutionarily related proteins. Comparison of the primary structures of homologous structures can be used to identify which residues are essential to its function, lesser significance, and little function. Invariant residue-the same side chain at a particular position in the amino acid sequence of related proteins. If an invariant residue is observed between related proteins, it is likely necessary to some essential function of the protein. Other amino acids may have less stringent side chain requirements-where amino acids may be conservatively substituted-(be substituted with an amino acid with similar properties). If many amino acids tolerated at a specific position - hypervariable. Cytochrome c • Cytochrome c is nearly universal eukaryotic protein necessary for electron transport. • Vertebrates 103-104 residues; up to 8 more aas in other phyla. • Similarities are observed in an alignment. • 38 of 105 residues are invariant and the others are conservatively substituted. • 8 positions are hypervariable. • His 18 and Met 80 form bonds with the redox Fe of the heme group. Page 184 Table 7-4a Amino Acid Sequences of Cytochromes c from 38 species. Page 185 Cytochrome c • Evolutionary differences between two homologous proteins are determined by counting the amino acid differences between them. • Order of differences parallels taxonomy and can be put into a table. • This data can be used to construct a phylogenetic tree-a tree that indicates ancestral relationships among organisms and their proteins. Page 186 Figure 7-21 • Page 187 • Each branch point indicates a possible common ancestor to everything above it. Relative evolutionary distances between neighboring branch points are expressed as the number of amino acid differences per 100 residues of the protein (percentage of accepted point mutations or PAM units). Phylogenic tree of cytochrome c. Evolutionary rates • Evolutionary distances between various species can be plotted against the time when species diverged. • Each protein has a characteristic rate of change-unit evolutionary period-the time required for the amino acid sequence of the protein to change by 1% after two species have diverged. • Acceptance rate of mutations depends on the extent to which amino acid affects function. • Amino acid substitutions in a protein mostly result from single base changes in the gene specifying the protein (point mutations). • Point mutations in DNA accumulate at a constant rate with timeresulting from random chemical change rather than errors from the replication process. Figure 7-22 Rates of evolution of four unrelated proteins. Page 188 Tolerant Intolerant of changes Evolutionary rates • Amino acid substitutions in a protein mostly result from single base changes in the gene specifying the protein (point mutations). • Point mutations in DNA accumulate at a constant rate with time-resulting from random chemical change rather than errors from the replication process. • Know this based on generation times of different organisms. Page 189 Figure 7-23 A phylogenetic tree for cytochrome c. Evolutionary rates • Protein evolution is not the basis for organismal evolution. • Rapid divergence is likely due to mutational changes in DNA that control gene expression. • Some proteins have extensive sequence similarity in the same organism resulting from a gene duplication. • Gene duplication is an efficient mode of evolution because the new gene can evolve a new functionality while the original directs synthesis of the ancestral protein. • Globins-an example-see Ch. 7. • Paralog-homologous proteins in the same organism • Orthologs-homologous proteins/genes in different organisms that arose through species divergence. Bioinformatics • Biotechnology meets computer science. • Sequence databases QuickTime™ and a PNG decompressor are needed to see this picture. QuickTime™ and a PNG decompressor are needed to see this picture. QuickTime™ and a PNG decompressor are needed to see this picture. QuickTime™ and a PNG decompressor are needed to see this picture. Sequence alignment • Sequence similarity of two polypeptides or two DNAs can be quantified by determining the number of aligned residues that are identical. • Human and dog cytochromes c differ in 11 or 104 residues [(104-11)/104] X 100=89% identical. • Human and yeast are [(104-45)/104] X 100 = 57% identical. • When determining percent identity, the length of the shorter peptide/DNA is by convention, used in the denominator. • Must also decide which amino acid residues are considered similar (e.g. Asp and Glu). Homology of distantly related proteins • Hypothetical example: Assume that we have a 100 residue protein in which all point mutations have equal probability of being accepted and happen at a constant rate. • At an evolutionary distance of 1 PAM unit, the original and evolved proteins are 99% identical. • At an evolutionary distance of 2 PAM units, they are (0.99)2 X 100 = 98% identical • At 50 PAM units, they are (0.99)50 X 100 = 60.5% identical. • This is due to the stochastic (random) process of mutation. Every residue has an equal chance of mutating. • These can be plotted as percent identity vs. evolutionary distance. Figure 7-25a Rate of sequence change in evolving proteins. (a) Protein evolving at random and that initially consists of 5% each of the 20 “standard” amino acid residues. Page 193 Approaches but never = 0! Homology of distantly related proteins • • • • • • Real proteins are more complex. Certain amino acids are more likely to be accepted than others. Distribution of amino acids in proteins is not uniform (9.5% are Leu on average and only 1.2% are Trp). Can also be affected by shifts in the sequence resulting from insertion or deletion of one or more residues within a chain. Example is different lengths of cytochrome c peptides. If the amino acid sequence is allowed to shift, the best alignment will increase. Homology of distantly related proteins SQMCILFKAQMNYGH MFYACRLPMGAHYWL Unlimited gapping SQMCILFKAQMNYGH --M---F-----YACRLPMGAHYWL • Unlimited gapping because of insertions and deletions (indels) cannot be allowed because we won’t get the proper alignment. • At the same time we need to allow for gapping (cyt c). • There must be a penalty for gaps. • Unrelated proteins will exhibit sequence identities (15-25%) which will be the same as distantly related proteins. • Requires more sophisticated algorithms to describe Figure 7-25b Rate of sequence change in evolving proteins. (b) A protein of average amino acid composition evolving as is observed in nature. Page 193 Area in which unrelated and distantly related proteins have sequence identity (15%-25%) Sequence alignments • Pairwise sequence alignments can be done by using a dot matrix. • One sequence is plotted horizontally and the other vertically. Whenever there is an identical residue you place a dot on the chart. • Dot plot of a peptide against itself results in a square matrix with a row of dots along the diagonal and scattered dots for chance identities. • If the peptides are conserved, there are only a few absences along the diagonal. • Distantly related peptides will have a number of gaps along the diagonal. Page 194 Figure 7-26 Sequence alignment with dot matrices. Sequence alignments • An alignment score (AS) is used to determine if there is any relationship. • 10 for every identity except Cys which scores 20 • Subtract 25 for every gap. • The normalized alignment score (NAS) by dividing the AS by the number of residues of the shortest of the two polypeptides in the alignment and multiplying by 100. • Example Human hemoglobin and myoglobin. Page 195 Figure 7-27 The optical alignments of human myoglobin (Mb, 153 residues) and the human hemoglobin a chain (Hba, 141 residues). Hemoglobina is 141 and myoglobin is 153 AS = number of identities X 10 + 20 for Cys -number of gaps = 37 identities X 10 + 20 (Cys) - (1gap X 25)= 365 NAS = AS/number of residues for shortest polypeptide =365/141 = 259 Page 195 Figure 7-28 A guide to the significance of normalized alignment scores (NAS) in the comparison of peptide sequences. Alignments are weighted according to the likelihood of substitution • Realistic way of assigning the probability of occurrence (weight) for a substitution is to look at the physical similarity of amino acids. • Dayhoff measured a number of residue exchanges for closely related proteins and determined their relative frequency of the 20 X 19/2 = 190 different possible residue changes. • This number is divided by 2 to account for the fact that A  B and B  A are equally likely. • These data can be used to create a square matrix (20 X 20) • The elements (20 properties per side) Mij, indicate the probability that, in a related sequence, amino acid i will replace amino acid j after an evolutionary interval (usually one PAM unit). • PAM-1 matrix. PAM matrix • • • • • • • • • Mutation probability can be determined for other evolutionary distances. PAM-N matrix is made bt multiplying the matrix by itself N times ([M]N). Relatedness odds matrix - Rij = Mij/fi fi = probability that the amino acid i will occur in the second sequence by chance. Rij = probability that amino acid i will replace amino acid j or vice versa every time i or j is encountered in the sequence. When two polypeptides are compared with each other, the Rij values for each position are multiplied to give the relatedness odds. For example A-B-C-D-E-F and P-Q-R-S-T-U, relatedness odds = RAP X RBQ X RCR X RDS X RET X RFU Log odds substitution matrix - is made by taking the log of the relatedness odds. Log odds need to be maximized to get the best alignment. Table 7-7 The PAM-250 Log Odds Substitution Matrix. Page 196 All elements multiplied by 10. Each diagonal element indicates the mutability of the corresponding amino acid. Neutral score = 0. Sequence alignment • Make a matrix with the log odds values associated with the amino acids at the appropriate positions. • Example use a PAM-250 log odds matrix with a 10 peptide horizontal and 11 peptide vertical. • The alignment of these two peptides must have at least one gap assuming a significant alignment can be found. • This is called a comparison matrix Page 197 Figure 7-29a Use of the Needleman-Wunsch alignment algorithm [alignment of 10-residue peptide (horizontal) with 11-residue peptide (vertical)]. (a) Comparison matrix. Needleman-Wunsch algorithm • Needleman and Wunsch constructed an algorithm to find the best alignment between 2 polypeptides. • Start at the lower right corner of the matrix (C-termini) at position M and N (these correspond to the 10th and 11th amino acid residues) and add the value to the position M-1, N-1 in the matrix. • Add to each element of the matrix the largest number from the row or column to the lower right of each element proceeding right to left, bottom to top. Page 197 Figure 7-29b Use of the Needleman-Wunsch alignment algorithm [alignment of 10-residue peptide (horizontal) with 11-residue peptide (vertical)]. (b) Transforming the matrix.
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            