Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Basic Overview of Bioinformatics Tools and Biocomputing Applications I Dr Tan Tin Wee Director Bioinformatics Centre Software Tools • Data stored in retrievable forms in database systems • Data generated by machines, DNA / Protein sequencers, automated systems Automated Machines Research Labs Biological Data Databases Analytical Tools New Knowledge Common Computational Analyses • Sequence Assembly • Simple sequence analysis – Translation and reverse Complement, ORF – Composition statistics (protein & DNA) – Molecular mass – Total charge and pI; local hydropathy – Simple determination of secondary structures – Restriction site analysis – Internal repeat analysis • Detection of active sites, functional residues, characteristic structures, substrates, and processing signals Common Computational Analyses • Database sequence search • Multiple alignment • 2 and 3 Structure prediction; transmembrane helix detection • Structure modeling • Docking prediction and design • Hidden Markov model searches Sequence Assembly • • • • 5' Fragmented data from DNA sequencers Detection of Overlap Merging of Contigs Assembly into continuous sequence 3' Sequence Format Interconversion • DNA/Protein and other sequence data come in different formats. • Annotations • Different programs use different formats • Interconversion utility tools • eg. READSEQ, TOGCG, TOSTADEN, etc Simple Sequence Analysis 1. Linear Sequence eg. DNA/ Protein 2. Open a Window - n = 1 n = variable n = sliding 3. Calculate based on list of criteria ………….… …………….. …………….. ……………... Some Simple Sequence Analysis Applications • DNA complementary strand eg. COMPLEMENT & REVERSE – Open window size 1 – – – – – – – – – – A--->T C --->G T ---> A G ---> C Slide to next Window of 1 Proceed to end of sequence Reverse order of complement 5' ...ATCTCGATACTACTACG...3' ||||||||||||||||| 3' ...TAGAGCTATGATGATGC...5' Some Simple Sequence Analysis Applications • DNA to Protein sequence translation, e.g. TRANSLATE – – – – – – Open window of 3 bases Look up Codon Usage table Assign Amino acid residue Slide window to next 3 bases Proceed till stop codon detected. Repeat whole procedure for six frames ATACTACTGAGATCTAGGCTAGTACTGCGTGCG Frame 1 Frame 2 Frame 3 Complement - Frames 4-6 Some Simple Sequence Analysis Applications • Detect Open Reading Frame e.g. ORF – Translate sequence, report long stretches of start and stop codons • Compositional analysis – eg. Calculate total A, T, G, C – eg. Calculate total molecular mass of protein, analysis percentages of amino acids – eg. Total Charge composition, pI Some Simple Sequence Analysis Applications • Simple prediction of secondary structure of Protein sequence – decide a window size – compute for each window of amino acids statistical potential to form helix, beta sheet, turn, etc. Chou-Fasman, GOR etc algorithms – use a statistical potential chart – plot potentials in graphical or pictorial format Some Simple Sequence Analysis Applications • Restriction Mapping eg. MAP, MAPPLOT,MAPSORT, PLASMIDMAP etc – Table of Restriction Enzymes gel and cut sites eg. EcoRI, BamHI AluI and their cut sites eg. GAATTC , AATT – Take a DNA sequence – Pattern match against the list of cut sites Plasmid – For each match, assign Restriction enzyme map – Calculate distance between cut sites – Display in table, graphical, or restriction map, etc Some Simple Sequence Analysis Applications • Protein sequence Motifs pattern matching eg. PROSITEMAP, MOTIFS, BLOCKS etc – Table/Database of Sequence Patterns/Motifs and their signature sequence eg. Arg-Gly-Asp (RGD) or consensus sequence (eg. PROSITE, BLOCKS db) – Take Protein sequence – Pattern match against the list of signature sites – For each match, assign potential function according to database – Display in table or graphically, or hyperlinked Some Simple Sequence Analysis Applications • Peptide Cleavage Maps eg. PEPTIDESORT, PEPTIDE MAP – Table of Protease vs Cleavage sites eg. Trypsin, chymotrypsin, and Chemical cleavage sites cyanogen bromide – Pattern match with entire protein sequence – Calculate size of peptide fragments – Sort and Map, Plot as electrophoretic patterns on a log-linear simulated digest. – Compute Partial Digest patterns Some Simple Sequence Analysis Applications • DOTPLOT- selfcomparison – Take a Window size – Compare against entire length of own sequence – Report matches above a threshold – Plot on Graph – Slide window, repeat till end of sequence – Detection of Internal repeats Sequence A • Pairwise comparison - detection of homology Some Simple Sequence Analysis Applications • RNA secondary structure analysis • Mfold, PlotFold, FoldRNA, Squiggles, Circles, Domes, Mountains, StemLoop • Folding of RNA into stems, loops AUCG • Calculation of energy U G - prediction of G A A-- U stability of structure U-- A • Display of structure G-- C C -- G and alternatives ...AUCGA AUCUC... Database Searching • Text-based Database Searching using a text string to match an annotation in a sequence database record, ie. Keyword search • Sequence-based Database Searching using a biological sequence to match its whole or parts of its sequence to the sequences of every sequence database records Text-Based Database Searching • Examples: Entrez, SRS, DBGET, AceDB - common integrated database systems • Search Concepts – – – – – Boolean Search - AND, OR, NOT Broadening Search Narrowing the Search Proximity searching, soundex Wild Card, Stemming eg. Thala* for thalasemia, thalassemia, thalassemic • Use standard string search algorithms and boolean operations, vocabulary matches Text-based Database Searching • Example: To find the human homolog of the Drosophila per gene • Procedure – – – – – Web to Entrez All Fields : enter "human" "per" Hits returned, irrelevant - broaden search "human" "period" - more hits check every one, find the human RIGUI gene • Hit and miss, clever guess work, free form or controlled vocabulary (MeSH terms)? Use Boolean searches? Sequence-based Database Searching • • • • • • • Homology Search Global or Local Sequence Alignment Needleman-Wunch Algorithm Smith-Waterman Algorithm Lipman - Pearson FASTA Altschul's BLAST Take a sequence, pairwise comparison with each sequence in the database Sequence-based Database Searching • Basic Assumptions: • Sequences of homologous Genes/Protein diverge over time even though structure and/or function change little • Significant sequence similarity inferred as potential structural /functional similarity or common evolutionary origin • Based on well-characterised protein, infer the function of an unknown sequence at gene or protein sequence level. Sequence-based Database Searching • Global Alignment forces complete alignment of the pairwise comparison of the two input sequences • Local Alignment looks for local stretches of similarity and tries to align the most similar segments • Algorithms used may be similar, but output different, statistics needed to assess results Sequence-based Database Searching • Alignment Scoring • Substitution score and substitution matrix PAM, BLOSUM • affine gap costs/gap penalty and gap scores • Optimal alignments, dynamic programming Needleman-Wunsch algorithm, Smith-Waterman algorithm (SSEARCH) • Additional heuristics - FASTA, BLAST