Download TM review

CS177 Review/Summary of the Madej lectures Tom Madej 12.07.05 Overview • Basic biology. • Protein/DNA sequence comparison. • Protein structure comparison/classification. • NCBI databases overview. • Miscellaneous topics. Lodish et al. Molecular Cell Biology, W.H. Freeman 2000 Protein/DNA sequence comparison • What is the meaning of a sequence alignment? • Scoring methods; amino acid substitution matrices, PSSMs. • Basic computational methods; e.g. BLAST. • Know how to run PSI-BLAST, interpret the results. Homology “… whenever statistically significant sequence or structural similarity between proteins or protein domains is observed, this is an indication of their divergent evolution from a common ancestor or, in other words, evidence of homology.” E.V. Koonin and M.Y. Galperin, Sequence – Evolution – Function, Kluwer 2003 A simple phylogenetic tree… Human hemoglobin and more distantly related globins • Human and horse • Human and fish • Human and insect • Human and bacteria Alignment notation: different notations for the same alignment! VISDWNMPN-------MDGLE CILVV----AANDGPMPQTRE VISDWnm---pnMDGLE CILVVaandgpmPQTRE Computing sequence alignments • You must be able to recognize the “answer” (correct alignment) when you see it (scoring system). • You must be able to find the answer; i.e. compute it efficiently. Scoring and computing alignments • “Position independent” amino acid substitution tables; e.g. BLOSUM62. • Global alignment algorithms such as Smith-Waterman (dynamic programming); or fast heuristics such as BLAST. Score this alignment: VISDWnm---pnMDGLE CILVVaandgpmPQTRE Use: BLOSUM62 matrix; gap opening penalty 10; gap extension penalty 1 (-1 + 4 – 2 – 3 – 3) –10 – 1*11 + (-2 + 0 – 2 – 2 + 5) = -27 BLAST (Basic Local Alignment Search Tool) • Extremely fast, can be on the order of 50-100 times faster than Smith-Waterman. • Method of choice for database searches. • Statistical theory for significance of results (extreme value distribution). • Heuristic; does not guarantee optimal results. • Many variants, e.g. PHI-, PSI-, RPS-BLAST. Why database searches? • Gene finding. • Assigning likely function to a gene. • Identifying regulatory elements. • Understanding genome evolution. • Assisting in sequence assembly. • Finding relations between genes. Issues in database searches • Speed. • Relevance of the search results (selectivity). • Recovering all information of interest (sensitivity). – The results depend on the search parameters, e.g. gap penalty, scoring matrix. – Sometimes searches with more than one matrix should be performed. E-values, P-values • E-value, Expectation value; this is the expected number of hits of at least the given score, that you would expect by random chance for the search database. • P-value, Probability value; this is the probability that a hit would attain at least the given score, by random chance for the search database. • E-values are easier to interpret than P-values. • If the E-value is small enough, e.g. no more than 0.10, then it is essentially a P-value. PSI-BLAST • Position Specific Iterated BLAST • As a first step runs a (regular) BLAST. • Hits that cross the threshold are used to construct a position specific score matrix (PSSM). • A new search is done using the PSSM to find more remotely related sequences. • The last two steps are iterated until convergence. PSSM (Position Specific Score Matrix) • One column per residue in the query sequence. • Per-column residue frequencies are computed so that log-odds scores may be assigned to each residue type in each column. • There are difficulties; e.g. pseudo-counts are needed if there are not a lot of sequences, the sequences must be weighted to compensate for redundancy. Two key advantages of PSSMs • More sensitive scoring because of improved estimates of probabilities for a.a.’s at specific positions. • Describes the important motifs that occur in the protein family and therefore enhances the selectivity. Position Specific Substitution Rates Weakly conserved serine Active site serine Position Specific Score Matrix (PSSM) 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 D G V I D S C N G D S G G P L N C Q A A R N 0 -2 0 -2 -1 0 -1 1 -3 -3 3 -3 -2 -5 0 4 -4 -4 -4 -7 -6 -2 0 2 -2 -3 -3 -5 -5 -2 -2 -4 -2 -3 -6 -4 -3 -6 -4 -2 -6 -6 -4 -6 -7 Active -1 -6 0 0 -4 -5 0 1 4 -1 -1 1 D C Q E G H I L K M F P 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 Serine scored differently -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 in-4these -4 -4 -5 7 two -4 -7 positions -7 -5 -4 -4 -6 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -7 -5nucleophile -5 -6 -7 0 -1 6 -6 1 0 -6 site -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 2 -5 2 0 0 0 -4 -2 1 0 0 0 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 S 0 -2 0 -3 1 4 -4 -1 -3 -4 7 -4 -2 -4 -6 -2 -1 -1 0 T -1 -1 -2 0 -3 3 -4 -3 -5 -4 -2 -5 -4 -4 -5 -1 0 -1 -2 W -6 0 -6 -1 -7 -6 -5 -3 -6 -8 -6 -6 -6 -7 -5 -6 -5 -3 -2 Y -4 -6 -4 -4 -5 -5 0 -4 -6 -7 -5 -7 -7 -7 -4 -1 0 -3 -2 V -1 -5 -2 0 -6 -3 -4 -3 -6 -7 -5 -7 -7 -6 0 6 0 -4 -3 PSI-BLAST key points • The first PSSM is constructed from all hits that cross the significance threshold using “standard” BLAST. • The search is then carried out with the PSSM to draw in new significant hits. • If new hits are found then a new PSSM is constructed; these last two steps are iterated. • The computation terminates upon “convergence”, i.e. when no new sequences are found to cross the significance threshold. Protein structure comparison/classification • Protein secondary structure elements. • Supersecondary structures (simple structure motifs). • Folds and domains. • Comparing structures (VAST). • Superfolds. • Fold classification (SCOP). • Conserved Domain Database (CDD). α-helix (3chy) backbone atoms with sidechains Parallel β-strands (3chy) Anti-parallel β-strands (1hbq) Higher level organization • A single protein may consist of multiple domains. Examples: 1liy A, 1bgc A. The domains may or may not perform different functions. • Proteins may form higher-level assemblies. Useful for complicated biochemical processes that require several steps, e.g. processing/synthesis of a molecule. Example: 1l1o chains A, B, C. Supersecondary structures • β-hairpin • α-hairpin • βαβ-unit • β4 Greek key • βα Greek key Supersecondary structure: simple units G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981 Supersecondary structure: Greek key motifs G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981 Protein folds • There is a continuum of similarity! • Fold definition: two folds are similar if they have a similar arrangement of SSEs (architecture) and connectivity (topology). Sometimes a few SSEs may be missing. • Fold classification: To get an idea of the variety of different folds, one must adjust for sequence redundancy and also try to correctly assign homologs that have low sequence identity (e.g. below 25%). Vector Alignment Search Tool (VAST) • Fast structure comparison based on representing SSEs by vectors. • A measure of statistical significance (VAST E-value) is computed (very differently from a BLAST E-value). • VAST structure neighbor lists useful for recognizing structural similarity. Superfolds (Orengo, Jones, Thornton) • Distribution of fold types is highly non-uniform. • There are about 10 types of folds, the superfolds, to which about 30% of the other folds are similar. There are many examples of “isolated” fold types. • Superfolds are characterized by a wide range of sequence diversity and spanning a range of non-similar functions. • It is a research question as to the evolutionary relationships of the superfolds, i.e. do they arise by divergent or convergent evolution? Superfolds and examples • • • • • Globin 1hlm sea cucumber hemoglobin; 1cpcA phycocyanin; 1colA colicin α-up-down 2hmqA hemerythrin; 256bA cytochrome B562; 1lpe apolipoprotein E3 Trefoil 1i1b interleukin-1β; 1aaiB ricin; 1tie erythrina trypsin inhibitor TIM barrel 1timA triosephosphate isomerase; 1ald aldolase; 5rubA rubisco OB fold 1quqA replication protein A 32kDa subunit; 1mjc major coldshock protein; 1bcpD pertussis toxin S5 subunit • • • • • α/β doubly-wound 5p21 Ras p21; 4fxn flavodoxin; 3chy CheY Immunoglobulin 2rhe BenceJones protein; 2cd4 CD4; 1ten tenascin UB αβ roll 1ubq ubiquitin; 1fxiA ferredoxin; 1pgx protein G Jelly roll 2stv tobacco necrosis virus; 1tnfA tumor necrosis factor; 2ltnA pea lectin Plaitfold (Split αβ sandwich) 1aps acylphosphatase; 1fxd ferredoxin; 2hpr histidine-containing phosphocarrier Fold classification (when you have the structure…) • First, look up PubMed abstracts for any relevant papers. E.g. if this is from a PDB file there will be references in it. • Try checking SCOP or CATH. • Look at VAST neighbors. See if the structure in question is highly similar to another structure with a known fold. SCOP (Structural Classification of Proteins) • http://scop.mrc-lmb.cam.ac.uk/scop/ • Levels of the SCOP hierarchy: – Family: clear evolutionary relationship – Superfamily: probable common evolutionary origin – Fold: major structural similarity Bioinformatics databases • Entrez is by far the most useful, because of the links between the individual databases, e.g. literature, sequence, structure, taxonomy, etc. • Other specialty databases available on the internet can also be very useful, of course! Links Between and Within Nodes Word weight Computational PubMed abstracts 3 -D 3-D Structures Structure Taxonomy Phylogeny VAST Computation Genomes Nucleotide BLAST Computationalsequences Protein sequences BLAST Computationa Entrez queries • Be able to formulate queries using index terms (Preview/Index), and limits. Exercises! • How many protein structures are there that include DNA and are from bacteria? • In PubMed, how many articles are there from the journal Science and have “Alzheimer” in the title or abstract, and “amyloid beta” anywhere? How many since the year 2000? • Notice that the results are not 100% accurate! • In 3D Domains, how many domains are there with no more than two helices and 8 to 10 strands and are from the mouse? P53 tumor suppressor protein • Li-Fraumeni syndrome; only one functional copy of p53 predisposes to cancer. • Mutations in p53 are found in most tumor types. • p53 binds to DNA and stimulates another gene to produce p21, which binds to another protein cdk2. This prevents the cell from progressing thru the cell cycle. G. Giglia-Mari, A. Sarasi, Hum. Mutat. (2003) 21 217-228. Exercise! • Use Cn3D to investigate the binding of p53 to DNA. • Formulate a query for Structure that will require the DNA molecules to be present (there are 2 structures like this). Miscellaneous topics • BLAST a sequence against a genome; locate hits on chromosomes with map viewer. • Obtain genomic sequence with map viewer. • Spidey to predict intron/exon structure (but we won’t use spidey on the exam!). • How sequence variations can affect protein structure/function. “EST exercise” summary • BLAST the EST (or other DNA seq) against the genome. • From the BLAST output you can get the genomic coordinates of any nucleotide differences. • Use map viewer to locate the hit on a chromosome; assume the hit is in the region of a gene. • By following the gene link you can get an accession for mRNA. • By using the “dl” link you can get an accession for the genomic sequence. • Use “spidey” with the mRNA and genomic sequence to locate changed residues in the protein. “EST exercise” summary (cont.) • From the gene report you can follow the protein link, and then “Blink”. • From the BLAST link page you can get to CDD and related structures. • Since you know where are the changed residues you can use the structures to study what effect the changes might have on the function of the protein. Gene variants that can affect protein function • Mutation to a stop codon; truncates the protein product! • Insertion/deletion of multiple bases; changes the sequence of amino acid residues. • Single point change could alter folding properties of the protein. • Single point change could affect the active site of the protein. • Single point change could affect an interaction site with another molecule. Important note! • Most diseases (e.g. cancer) are complex and involve multiple factors (not just a single malfunctioning protein!).

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download TM review