* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download AP review
Biosynthesis wikipedia , lookup
Magnesium transporter wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Epitranscriptome wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene nomenclature wikipedia , lookup
Expression vector wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Western blot wikipedia , lookup
Biochemistry wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Metalloprotein wikipedia , lookup
Community fingerprinting wikipedia , lookup
Gene regulatory network wikipedia , lookup
Interactome wikipedia , lookup
Proteolysis wikipedia , lookup
Genetic code wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Point mutation wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Gene expression wikipedia , lookup
Sequence specific recognition of DNA by proteins. • Nitrogen and oxygen exposed in the grooves can make hydrogen bonds with proteins. • Different Watson/Crick base pairs have different patterns of donors and acceptors - H-bond acceptor - hydrogen atom - H-bond donor - methyl group G C G C A T A T C G C G T A T A Major groove Minor groove Difference between DNA & RNA: Differences between DNA & RNA: • T is replaced by U • Extra –OH group at 2’ pentose sugar, sugar is ribose, not deoxyribose • RNA usually does not form double helix, makes loops within one strand, often contains modified bases • RNA has an additional 2’-OH group which can form HB, stabilizing tertiary structure Illustration of RNA secondary structures. From M.S. Andronescu DNA/RNA thermodynamics. Two major types of interactions: • Base pairing (hydrogen bonds) • Base stacking of nearest neighbors (π-electron sharing of aromatic rings+ hydrophobic) G G init G pairing G stacking RNA secondary structure prediction Assumptions used in predictions: - The most likely structure is the most stable one. - The energy of each base pair depends only on the energy of the previous base pair. - Energy parameters for different types of RNA secondary structures are derived from the experiment. - The structure is formed w/o knots. Minimum energy method of RNA secondary structure prediction. • Self-complementary regions can be found in a dot matrix • The energy of each base pair depends only on the energy of the previous base pair • Energy parameters for different types of RNA secondary structures are derived from the experiment • The most energetically favorable conformations are predicted by the method similar to dynamic programming Sequence covariation method. Some positions from different species can covary because they are involved in pairing fm(B1) - frequences in column m; fn(B2) – frequences in column n; fm,n(B1,B2) – joint frequences of two nucleotides in two columns. f m,n ( B1 , B2 ) /( f m ( B1 ) f n ( B2 )) Seq 1 Seq 2 Seq 3 Seq 4 ---G------C-----G------C-----A------T-----T------A--- Gene prediction. Gene – DNA sequence encoding protein, rRNA, tRNA … Gene concept is complicated: - Introns/exons - Alternative splicing - Genes-in-genes - Multisubunit proteins Codon usage tables. - Each amino acid can be encoded by several codons. - Each organism has characteristic pattern of codon usage. Problems arising in gene prediction. • Distinguishing pseudogenes (not working former genes) from genes. • Exon/intron structure in eukaryotes, exon flanking regions – not very well conserved. • Exon can be shuffled alternatively – alternative splicing. • Genes can overlap each other and occur on different strands of DNA. Gene identification • Homology-based gene prediction – Similarity Searches (e.g. BLAST, BLAT) – ESTs • Ab initio gene prediction – Prokaryotes • ORF identification – Eukaryotes • Promoter prediction • PolyA-signal prediction • Splice site, start/stop-codon predictions Ab initio gene prediction. Predictions are based on the observation that gene DNA sequence is not random: - Gene-coding sequence has start and stop codons. - Each species has a characteristic pattern of synonymous codon usage. - Non-coding ORFs are very short. - Gene would correspond to the longest ORF. These methods look for the characteristic features of genes and score them high. Example of ORFs. There are six possible ORFs in each sequence for both directions of transcription. Gene preference score – important indicator of coding region. Observation: frequencies of codons and codon pairs in coding and noncoding regions are different. Given a sequence of codons: and assuming independence, the probability of finding coding region: The probability of finding sequence “C” in non-coding regions: The gene preference score: P(C ) GPS log( ) P0 (C ) Gene prediction accuracy. True positives (TP) – nucleotides, which are correctly predicted to be within the gene. Actual positives (AP) – nucleotides, which are located within the actual gene. Predicted positives (PP) – nucleotides, which are predicted in the gene. Sensitivity = TP / AP Specificity = TP / PP The value of genome sequences lies in their annotation • Annotation – Characterizing genomic features using computational and experimental methods • Genes: levels of annotation – Gene Prediction – Where are genes? – What do they encode? – What proteins/pathways involved in? Human Genome project. Analysis of gene order (synteny). Genes with a related function are frequently clustered on the chromosome. Ex: E.coli genes responsible for synthesis of Trp are clustered and order is conserved between different bacterial species. Operon: set of genes transcribed simultaneously with the same direction of transcription Structure and stability of globular proteins. Native proteins are marginally stable. Scale of interactions in proteins: G - Interactions less than kT~0.6 kcal/mol are neglected. - ΔG ~ 5 - 20 kcal/mol U F ΔG Reaction coordinate Potential energy = Van der Waals + Electrostatic + … Hydrophobic effect. Hydrophobic interaction – tendency of H O nonpolar compounds to transfer from an aqueous solution to an organic phase. H O H H - The entropy of water molecules decreases when they make a contact with a nonpolar surface (TΔS = -9.6 kcal/mol for cyclohexane) . - The effect is entropic because the energy of HB is very high. - The hydrophobic effect is proportional to buried surface area, the energy is ~ 20-25 cal/mol/A^2 Hierarchy of protein structure. 1. 2. 3. 4. Amino acid sequence Secondary structure Tertiary structure Quaternary structure Picture from Branden & Tooze “Introduction to protein structure” Protein secondary structure prediction. Assumptions: • There should be a correlation between amino acid sequence and secondary structure. Short aa sequence is more likely to form one type of SS than another. • Local interactions determine SS. SS of a residues is determined by their neighbors (usually a sequence window of 13-17 residues is used). Exceptions: short identical amino acid sequences can sometimes be found in different SS. Accuracy: 65% - 75%, the highest accuracy – prediction of an α helix Methods of SS prediction. • Chou-Fasman method • GOR (Garnier,Osguthorpe and Robson) • Neural network method PHD – neural network program with multiple sequence alignments. • Blast search of the input sequence is performed, similar sequences are collected. • Multiple alignment of similar sequences is used as an input to a neural network. • Sequence pattern in multiple alignment is enhanced compared to if one sequence used as an input. Protein structure prediction. Fold recognition. Unsolved problem: direct prediction of protein structure from the physicochemical principles. Solved problem: to recognize, which of known folds are similar to the fold of unknown protein. Fold recognition is based on observations/assumptions: - The overall number of different protein folds is limited (1000-3000 folds) - The native protein structure is in its ground state (minimum energy) Protein structure prediction. Prediction of three-dimensional structure from its protein sequence. Different approaches: - Homology modeling (predicted structure has a very close homolog in the structure database). - Fold recognition (predicted structure has an existing fold). - Ab initio prediction (predicted structure has a new fold). Steps of homology modeling. 1. 2. 3. 4. 5. Template recognition & initial alignment. Backbone generation. Loop modeling. Side-chain modeling. Model optimization. 1. Template recognition. Recognition of similarity between the target and template. Target – protein with unknown structure. Template – protein with known structure. Main difficulty – deciding which template to pick, multiple choices/template structures. Template structure can be found by searching for structures in PDB using sequence-sequence alignment methods. Fold recognition. Goal: to find protein with known structure which best matches a given sequence. Since similarity between target and the closest to it template is not high, sequence-sequence alignment methods fail. Solution: threading – sequence-structure alignment method. Threading – method for structure prediction. Sequence-structure alignment, target sequence is compared to all structural templates from the database. Requires: - Alignment method (dynamic programming, Monte Carlo,…) - Scoring function, which yields relative score for each alternative alignment Scoring function for threading. • Contact-based scoring function depends on the amino acid types of two residues and distance between them. • Sequence-sequence alignment scoring function does not depend on the distance between two residues. • If distance between two nonadjacent residues in the template is less than 8 Å, these residues make a contact. Threading model validation. • Correct bond length and bond angles >> 3.8 Angstroms • Correct placement of functionally important sites • Prediction of global topology, not partial alignment (minimum number of gaps) Classwork II: Homology modeling. - Go to NCBI Entrez, search for gi461699 Do Blast search against PDB Repeat the same for gi60494508 Predict functionally important sites Protein engineering and protein design. Protein engineering – altering protein sequence to change protein function or structure Protein design – designing de novo protein which satisfies a given requirement Stability of mutants compared to wild-type protein. Measure of stability – melting temperature at which 50% of enzyme is inactivated during reversible heat denaturation. For wild-type Tm = 42 C. • all mutants were more stable than wild-type. • the longer the loop between Cys, the larger the effect (the more restricted is unfolded state). • the more disulfide bonds were introduced, the more stable was the mutant. From B. Mathews et al Can structural scaffolds be reduced in size with maintaining function? A. Braisted & J.A. Wells used Z-domain (58 residues) of bacterial protein A: • removed third helix (truncated protein - 38 residues); • mutated residues in the first and second helices; • used phage display to select active forms; • restored the binding of truncated protein. Designing an amino acid sequence that will fold into a given structure. • Inverse protein folding problem: designing a sequence which will fold into a given structure – much easier than folding problem! • B. Dahiyat & S. Mayo: designed a sequence of zinc finger domain that does not require stabilization by Zn. • Wild type protein domain is stabilized by Zn (bound to two Cys and two His); mutant is stabilized by hydrophobic interactions. Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic trees were constructed based on the comparative morphology and physiology. Molecular phylogenetics: phylogenetic trees are constructed by comparing DNA/protein sequences between organisms. Mechanisms of evolution. - By mutations of genes. Mutations spread through the population via genetic drift and/or natural selection. - By gene duplication and recombination. Measures of evolutionary distance between amino acid sequences. 1. P-distance. Evolutionary distance is usually measures by the number of amino acid substitutions. p nd / n nd – number of amino acid differences between two sequences; n – number of aligned amino acids. Poisson correction for evolutionary distance. 2. PC-distance. Takes into account multiple substitutions and therefore is proportional to divergence time. PC-distance can be expressed through the pdistance: d ln( 1 p) The concept of evolutionary trees. - Trees consist of nodes and branches, topology - branching pattern. - The length of each branch represents the number of substitutions occurring between two nodes. If rate of evolution is constant, branches will have the same length (molecular clock hypothesis). - The distance along the tree is calculated by summing up all intervening branch lengths. - Trees can be binary or bifurcating. - Trees can be rooted and unrooted. The root is placed by including a taxon which is known to branch off earlier than others. Accuracies of phylogenetic trees. Two types of errors: - Topological error - Branch length error Bootstrap test: Resampling of alignment columns with replacement; recalculating the tree; counting how many times this topology occurred – “bootstrap confidence value”. If it is close to 100% – reliable topology/interior branch. Estimation of evolutionary rates in hemoglobin alpha-chains. P-distance PC-distance Gamma-distance Human/cow 0.121 0.129 0.134 Human/kangaroo 0.186 0.205 0.216 Human/carp 0.486 0.665 0.789 Estimate the evolutionary rate of divergence between human and cow (time of divergence between these groups is ~90 millions years). 1. Distance methods. Calculating branch lengths from distances. A B C A ----- 20 30 B ----- ----- 44 C ----- ----- ----- a c b a b 20; a c 40; b c 44; a 8; b 12; c 32. Neighbor-joining method. NJ is based on minimum evolution principle (sum of branch length should be minimized). Given the distance matrix between all sequences, NJ joins sequences in a tree so that to give the estimate of branch lengths. C 1. BStarts with the stard tree, the sum of branch a bcalculates ; lengths. d a c; b c AB AC a d D d AD a d ; d AE a e; S abcd e (d AB d AC d AD d AE d BC d BD d BE d CD d CE d DE ) /( N 1) e A E 2.1 Maximum parsimony: definition of informative sites. Maximum parsimony tree – tree, that requires the smallest number of evolutionary changes to explain the differences between external nodes. Site, which favors some trees over the others. 1 2 3 4 A A A A A G G G G C A A A C T G 5 6 7 C T G C T G T T C T T C * * Site is informative (for nucleotide sequences) if there are at least two different kinds of letters at the site, each of which is represented in at least two of the sequences.