Download Three main topics for this Intro lecture

Simple Rearrangements Reversals 1 2 3 9 8 4 7 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 • Blocks represent conserved genes. 6 5 10 1 2 Reversals 3 9 8 4 7 1, 2, 3, -8, -7, -6, -5, -4, 9, 10   10 6 5 Blocks represent conserved genes. In the course of evolution or in a clinical context, blocks 1,…,10 could be misread as 1, 2, 3, -8, -7, -6, -5, -4, 9, 10. Types of Rearrangements Reversal 1 2 3 4 5 6 1 2 -5 -4 -3 6 Translocation 1 2 3 45 6 1 26 4 53 Fusion 1 2 3 4 5 6 1 2 3 4 5 6 Fission Sorting by reversals: 5 steps Step Step Step Step Step Step 0: p 2 -4 1: 2 3 2: 2 3 3: 2 3 4: -8 -7 5: g 1 2 -3 4 4 4 -6 3 5 5 5 5 -5 4 -8 -8 6 6 -4 5 -7 -7 7 7 -3 6 -6 -6 8 8 -2 7 1 1 1 -1 -1 8 Sorting by reversals: 4 steps Step Step Step Step Step 0: p 2 -4 -3 1: 2 3 4 2: -5 -4 -3 3: -5 -4 -3 4: g 1 2 3 5 5 -2 -2 4 -8 -8 -8 -1 5 -7 -7 -7 6 6 -6 -6 -6 7 7 1 1 1 8 8 Sorting by reversals: 4 steps Step Step Step Step Step 0: p 2 -4 -3 1: 2 3 4 2: -5 -4 -3 3: -5 -4 -3 4: g 1 2 3 5 5 -2 -2 4 -8 -8 -8 -1 5 -7 -7 -7 6 6 -6 -6 -6 7 7 1 1 1 8 8 What is the reversal distance for this permutation? Can it be sorted in 3 steps? From Signed to Unsigned Permutation (Continued) • Construct the breakpoint graph as usual • Notice the alternating cycles in the graph between every other vertex pair • Since these cycles came from the same signed vertex, we will not be performing any reversal on both pairs at the same time; therefore, these cycles can be removed from the graph 0 5 6 10 9 15 16 12 11 7 8 14 13 17 18 3 4 1 2 19 20 22 21 23 Reversal Distance with Hurdles • Hurdles are obstacles in the genome rearrangement problem • They cause a higher number of required reversals for a permutation to transform into the identity permutation • Let h(π) be the number of hurdles in permutation π • Taking into account of hurdles, the following formula gives a tighter bound on reversal distance: d(π) ≥ n+1 – c(π) + h(π) Median Problem Goal: find M so that DAM+DBM+DCM is minimized NP hard for most metric distances Genome Enumeration for Multichromosome Genomes . . . Genome Enumeration For genomes on gene {1,2,3} 2 . . . -3 1 -1 $ 23 . . . 2 . . . 3 $ ‹ 1, 2, 3 › -3 $ ‹ 1, 2, -3 › 3 $ ‹ 1, 2 › ‹ 3 › -3 $ ‹ 1, 2 › ‹ -3 › ... ... -2-3 ... 3 ... -3 ... Rearrangement Phylogeny Compute A Given Tree (Start) Compute A Given Tree (First Median) Compute A Given Tree (Second Median) Compute A Given Tree (Third Median) Compute A Given Tree (After 1st Iteration) Binary Encoding MLBE Sequences Experimental Results (Equal Content) 80% inversion, 20% transposition An Example—New Genomes 1 2 1 -4 … 3 5 1 3 5 7 9 1 5 9 -7 3 … 4 2 5 6 8 10 7 8 9 10 9 -7 -6 3 Jackknifing Rate Support Value Threshold - FP Up to 90% FP can be identified with 85% as the threshold Jackknife Properties • Jackknifing is necessary and useful for gene order phylogeny, and a large number of errors can be identified • 40% jackknifing rate is reasonable • 85% is a conservative threshold, 75% can also be used • Low support branches should be examined in detail Protein In-silico Biochemistry • Online servers exist to determine many properties of your protein sequences • Molecular weight • Extinction coefficients • Half-life • It is also possible to simulate protease digestion • All these analysis programs are available on • www.expasy.ch Analyzing Local Properties • Many local properties are important for the function of your protein • Hydrophobic regions are potential transmembrane domains • Coiled-coiled regions are potential protein-interaction domains • Hydrophilic stretches are potential loops • You can discover these regions • Using sliding-widow techniques (easy) • Using prediction methods such as hidden Markov Models (more sophisticated) Sliding-window Techniques • Ideal for identifying strong signals • Very simple methods • Few artifacts • Not very sensitive • Use ProtScale on www.expasy.org • Make the window the same size as the feature you’re looking for www.expasy.org/cgi-bin/protscale.pl www.expasy.org/cgi-bin/protscale.pl www.expasy.org/cgi-bin/protscale.pl www.expasy.org/cgi-bin/protscale.pl Hphob. / Eisenberg Transmembrane Domains • Discovering a transmembrane domain tells you a lot about your protein • Many important receptors have 7 transmembrane domains • Transmembrane segments can be found using ProtScale • The most accurate predictions come from using TMHMM Using TMHMM • TMHMM is the best method for predicting transmembrane domains • TMHMM uses an HMM • Its principle is very different from that of ProtScale • TMHMM output is a prediction TMHMM vs. ProtScale >sp|P78588|FREL_CANAX Probable ferric reductase transmembrane component OS=Candida albicans GN=CFL1 PE=3 SV=1 MTESKFHAKYDKIQAEFKTNGTEYAKMTTKSSSGSKTSTSASKSSKSTGSSNASKSSTNA HGSNSSTSSTSSSSSKSGKGNSGTSTTETITTPLLIDYKKFTPYKDAYQMSNNNFNLSIN YGSGLLGYWAGILAIAIFANMIKKMFPSLTNNLSGSISNLFRKHLFLPATFRKKKAQEFS IGVYGFFDGLIPTRLETIIVVIFVVLTGLFSALHIHHVKDNPQYATKNAELGHLIADRTG ILGTFLIPLLILFGGRNNFLQWLTGWDFATFIMYHRWISRVDVLLIIVHAITFSVSDKAT GKYKNRMKRDFMIWGTVSTICGGFILFQAMLFFRRKCYEVFFLIHIVLVVFFVVGGYYHL ESQGYGDFMWAAIAVWAFDRVVRLGRIFFFGARKATVSIKGDDTLKIEVPKPKYWKSVAG GHAFIHFLKPTLFLQSHPFTFTTTESNDKIVLYAKIKNGITSNIAKYLSPLPGNTATIRV LVEGPYGEPSSAGRNCKNVVFVAGGNGIPGIYSECVDLAKKSKNQSIKLIWIIRHWKSLS WFTEELEYLKKTNVQSTIYVTQPQDCSGLECFEHDVSFEKKSDEKDSVESSQYSLISNIK QGLSHVEFIEGRPDISTQVEQEVKQADGAIGFVTCGHPAMVDELRFAVTQNLNVSKHRVE YHEQLQTWA Search with Accession number P78588 http://www.uniprot.org/uniprot/ www.cbs.dtu.dk/services/TMHMM-2.0 www.cbs.dtu.dk/services/TMHMM-2.0 Predicting Post-translational Modifications • Post-translational modifications often occur on similar motifs in different proteins • PROSITE is a database containing a list of known motifs, each associated with a function or a post-translational modification • You can search PROSITE by looking for each motif it contains in your protein (the server does that for you!) • PROSITE entries come with an extensive documentation on each function of the motif Searching for PROSITE Patterns • Search your protein against PROSITE on ExPAsy • www.expasy.org/tools/scanprosite • PROSITE motifs are written as patterns • Short patterns are not very informative by themselves • They only indicate a possibility • Combine them with other information to draw a conclusion • Remember: Not everything is in PROSITE ! www.expasy.org/tools/scanprosite P12259 www.expasy.org/tools/scanprosite Interpreting PROSITE Patterns • Check the pattern function: Is it compatible with the protein? • Sometimes patterns suggest nonexistent protein features • For instance : If you find a myristoylation pattern in a prokaryote, ignore it; prokaryotic proteins have no myristoylation ! • Short patterns are more informative if they are conserved across homologous sequences • In that case, you can build a multiple-sequence alignment • This slide shows an example Patterns and Domains • Patterns are usually the most striking feature of the more general motifs (called domains) • Domains are less conserved than patterns but usually longer • In proteins, domain analysis is gradually replacing pattern analysis Protein Domains • Proteins are usually made of domains • A domain is an autonomous folding unit • Domains are more than 50 amino acids long • It’s common to find these together: • A regulatory domain • A binding domain • A catalytic domain Discovering Domains • Researchers discover domains by • Comparing proteins that have similar functions • Aligning those proteins • Identifying conserved segments • A domain is a multiple-sequence alignment formulated as a profile • For each column, a domain indicates which amino acid is more likely to occur Domain Collections • Scientists have been discovering and characterizing protein domains for more than 20 years • 8 collections of domains have been established • Manual collections are very precise but small • Automatic collections are very extensive but less informative • These collections • Overlap • Have been assembled by different scientists • Have different strengths and weaknesses • We recommend using them all! The Magnificent 8 • Pfam is the most extensive manual collection • Pfam is often used as a reference Searching Domain Collections • Domains in Pfam often include known functions • A match between your protein and a domain is desirable • A match is a potential indication of a function • This is VERY informative for further research! • Three servers exist to compare proteins and domain collections: • InterProScan www.ebi.ac.uk/interproscan • CD-Search (conserved Domain) www.ncbi.nih.nlm.gov • Motif Scan www.ch.embnet.org Using InterProScan • InterProScan is the most comprehensive search engine for domain databases • Makes it possible to compare alternative results on most collections • Does not provide a statistical score >sp|P53539|FOSB_HUMAN Protein fosB OS=Homo sapiens GN=FOSB PE=1 SV=1 MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTA ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGS GGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRRRELT DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRD LPGSAPAKEDGFSWLLPPPPPPPLPFQTSQDAPPNLTASLFTHSEVQVLGDPFPVVNPSY TSSFVLTCPEVSAFAGAQRTSGSDQPSDPLNSPSLLAL www.ebi.ac.uk/InterProScan www.ebi.ac.uk/InterProScan The CD-Search Output • CD search is less extensive than that of InterProScan • Results come with a a statistical evaluation (E-value) • 10e-15 Low E-value Good match • 2.1 High E-value Bad match www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi Predicting Functions with Domains • Finding a match with a domain having a catalytic function is good news . . . but what, exactly, does it mean? • A match indicates that your sequence has the domain structure . . . but does it also have the function? • You cannot say before looking into these details: • Where are the catalytic residues on the domain? • Does your sequence have the right residues at these positions? Looking into the Details • Catalytic residues are normally highly conserved in domains • Motif Scan makes it possible to check whether these important residues are conserved in your sequence • High bar above 0 = Highly conserved residues • Green = Your sequence has an expected residue • Red = Your sequence has an unexpected residue Looking into the Details (cont’d.)  R (Arginine) is highly expected at this position High bar Potential active site  If your protein has an arginine on this position . . . Bar is filled with green Your protein could be active myhits.isb-sib.ch/cgi-bin/motif_scan Protein 3D Structure Primary, Secondary and Tertiary Structures • Proteins are made of 20 amino acids • Proteins are on average 400 amino acids long • Protein structure has 3 levels: • The primary structure is the sequence of a protein • The secondary structure is the local structure • The tertiary structure is the exact position of each atom on a 3D model Secondary Structures • Helix • Amino acid that twists like a spring • Beta strand or extended • Amino acid forms a line without twisting • Random coils • Amino acid with a structure neither helical nor extended • Amino-acid loops are usually coils Guessing the Secondary Structure of Your Protein • Secondary structure predictions are good • If your protein has enough homologues, expect 80% accuracy • The most accurate secondary structure prediction server is PSIPRED PSIPRED Output • Conf = Confidence • 9 is the best, 0 the worst • Pred = Every amino acid is assigned a letter: • C for coils • E for extended or beta-strand • H for helix >gi|15892329|ref|NP_360043.1| translocation protein TolB [Rickettsia conorii str. Malish 7] MRNIIYFILSLLFSVTSYALETINIEHGRADPTPIAVNKFDADNSAADVLGHDMVKVISNDLKLSGLFRP ISAASFIEEKTGIEYKPLFAAWRQINASLLVNGEVKKLESGKFKVSFILWDTLLEKQLAGEMLEVPKNLW RRAAHKIADKIYEKITGDAGYFDTKIVYVSESSSLPKIKRIALMDYDGANNKYLTNGKSLVLTPRFARSA DKIFYVSYATKRRVLVYEKDLKTGKESVVGDFPGISFAPRFSPDGRKAVMSIAKNGSTHIYEIDLATKQL HKLTDGFGINTSPSYSPDGKKIVYNSDRNGVPQLYIMNSDGSDVQRISFGGGSYAAPSWSPRGDYIAFTK ITKGDGGKTFNIGIMKACPQDDENSERIITSGYLVESPCWSPNGRVIMFAKGWPSSAKAPGKNKIFAIDL TGHNEREIMTPADASDPEWSGVLN bioinf.cs.ucl.ac.uk/psipred//?program=psipred bioinf.cs.ucl.ac.uk/psipred//?program=psipred bioinf.cs.ucl.ac.uk/psipred//?program=psipred bioinf.cs.ucl.ac.uk/psipred//?program=psipred Predicting Other Secondary Features • It is also possible to predict these accurately: • • • • Transmembrane segments Solvent accessibility Globularity Coiled/coil regions • All these predictions have an expected accuracy higher than 70% Servers • • • • www.predictprotein.org cubic.bioc.columbia.edu/predictprotein www.sdsc.edu/predicprotein www.cbi.pku.edu.cn/predictprotein Predicting 3D Structures • Predicting 3D structures from sequences only is almost impossible • The only reliable way to establish the 3D structure of a protein is to make a real-world experiment in • X-ray crystallography • Nuclear magnetic resonance (NMR) • Structures established this way are conserved in the PDB database • “The PDB of my protein” is synonymous with “The structure of my protein” Retrieving Protein Structures from PDB • All PDB entries are 4-letter words! • 1CRZ, 2BHL . . . • Sometimes the chain number is added: • 1CRZA, 1CRZB . . . • To access all PDB entries, go to www.rcsb.org • PDB contains 42,000 entries • PDB contains the structure of 16,000 unique proteins or RNAs • You can download the coordinates and display the structure www.rcsb.org www.rcsb.org Displaying a PDB Structure • You can use any of the online viewers to display the structure • They will let you rotate the structure, zoom in and out, or color it • PDB files themselves are not human-readable Predicting the Structure of Your Protein • The bad news: • It is very hard to predict protein 3D structures • The good news: • Similar proteins have similar structures • If your favorite protein has a homologue with a known structure . .. • You can do homology modeling • How? • Start with a BLAST (more about that in the next slide) ncbi.nlm.nih.gov/BLAST ncbi.nlm.nih.gov/BLAST BLASTing PDB for Structures • BLAST your protein against PDB • If you get a very good hit, it means PDB contains a protein similar to yours • Your protein and this hit probably have the same structure Be Careful! • Sometimes only one of the domains contained in your protein has been characterized • If that’s the case, the PDB will only contain this domain • Always check the alignments • Red line = full protein in PDB • Blue line = one domain only in this entry Structures and Sequences • Highly conserved sequences are often important in the structure • Make a multiple-sequence alignment to identify these important positions • Highly conserved positions are either in the core or important for protein/protein interactions 3D Predictions • If you want to predict the structure of your protein automatically, try the Swiss Model • Swiss Model makes the BLAST for you • The program does a bit of homology modeling • The process delivers a new PDB entry • You can access it at swissmodel.expasy.org • Swiss Model gives good results for proteins having homologues in PDB zhanglab.ccmb.med.umich.edu/I-TASSER/ zhanglab.ccmb.med.umich.edu/I-TASSER/ 3D-BLAST • Use this technique if you have a structure and you want to find other similar structures • Use VAST or DALI to look for proteins having the same 3D shape as yours • www.eb.ac.uk/dali • www.ncbi.nlm.nih/vast 3D Movements • Most proteins need to move to do their job • Predicting protein movement is possible using molecular dynamics • Check out this site: molmolvdb.mbb.yale.edu • Good molecular dynamics requires extremely powerful computers • Don’t expect miracles from standard online resources

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Three main topics for this Intro lecture