Download Cryptography and Linguistics of Macromolecules Cryptography and

Cryptography and Linguistics of Macromolecules Studying the codes and language of life George Michaels April 5, 2004 2 3 <cryptography> The practise and study of encryption and decryption - encoding data so that it can only be decoded by specific individuals. A system for encrypting and decrypting data is a cryptosystem. These usually involve an algorithm for combining the original data ("plaintext") with one or more "keys" - numbers or strings of characters known only to the sender and/or recipient. The resulting output is known as "ciphertext". 4 5 6 7 Linguistic analysis The study of the nature, structure, and variation of language, including phonetics, phonology, morphology, syntax, semantics, sociolinguistics, and pragmatics. Living systems have context free and context dependent patterns of communication z Statistics - Frequency - Pattern Analysis Lexicon - Predefined concepts Linguistic - Syntax - Semantic - Logic Concept extraction 8 Chomsky Hierarchy of Languages Biological systems Î 9 10 DNA Linguistics David Searls is probably the first to use sophisticated linguistic methods in the analysis and classification of features in DNA sequences. He also originally suggested the idea that grammars could be used to describe the structure of nonmacromolecules, presenting a very early version of what later became our terminal form. A pleasant introduction to the application of linguistics to DNA sequence analysis can be found in Searls, D. B., 1992. The linguistics of DNA. Am. Scient. 80: 579-591. The 11 12 13 14 DNA & RNA patterns 15 tRNA grammars 16 Protein Domain Interactions 17 Composition analysis Check for Putative Sequencing Errors z z ORF Finder Detection of Frameshift Sequencing Errors - a more sophisticated alignment procedure Codon usage, composition Search Genes and Coding Regions codontree: codon usage table, distance matrix and bases composition (Pesole,Attimonelli and Liuni). codonW: a package for codon usage analysis. It was designed to simplify Multivariate Analysis (MVA) of codon usage (John Peden). . EMBOSS: - cusp: Create a codon usage table. chips: Codon usage statistics. codcmp: Codon usage table comparison. syco: Synonymous codon usage Gribskov statistic plot. wordcount: Counts words of a specified size in a DNA sequence. geecee: Calculates the fractional GC content of nucleic acid sequences. 18 Gene Searching Frame Plot Analysis predicts protein-coding region of high G+C content bacterial DNA ORF Finder identifies all open reading frames using the standard or alternative genetic codes Gene recognition by Spliced Alignment PROCRUSTES - User-defined exon-intron structure of genomic sequence is used for computing the quality of prediction. Gene Recognition and Assembly Internet Link Grail Version 1.3 GENSCAN Web Server Identification of complete gene structures in genomic DNA GeneID Web Server Gene Identification and Gene Structure Prediction in Genomic DNA Sequences Genie: A Gene Finder Based on Generalized Hidden Markov Models Computational Genomics Group of the Sanger Centre Finding genes, promoters, poly-A, splice sites in different organisms BCM Gene Finder Splice sites, Protein coding exons and Gene model construction, Promotor and poly-A region recognition MZEF Human Internal Coding Exon Finder by Michael Zhang HMMgene Exon Finder Splice Site Prediction by Neural Network finds possible donor and acceptor sites. The FINEX project (intron/exon boundary phase and the exon length) Pol3scan searches the eukaryotic polymerase III intragenic control regions A box and B box. tRNA analysis detection of transfer RNAs 19 Protein 3D structure is determined by the primary amino acid sequence and environment. 3 Secondary structural units: • Coiled • β sheet • Random 20 Pattern Searching MEME Motif discovery tool Promoter Prediction by Neural Network PROMOTER SCAN II Presently it is limited to mammalian promoter sequences WWW Promoter Scan Vers 1.7 predicts Promoter regions based on scoring homologies with putative eukaryotic Pol II promoter sequences MatInspector V2.0 wiht TRANSFAC PatternSearch with TRANSFAC sites Pratt discovers patterns conserved in sets of unaligned protein sequences. FPAT Regular Expression Searches of Sequence Databases. Search database of consensus sequences from repeat families many Blasts RepeatMasker2 Web Server fuzznuc: Nucleic acid pattern search (EMBOSS). fuzztran: Protein pattern search after translation (EMBOSS) 21 Alignments Fasta : pariwise with gaps Smith-Waterman exhaustive pairwise for optimal alignments Multiple Sequence Alignment CLUSTALW and MSA. The USC Sequence Alignment Server ToPLign: alignment methods with flexible parameter handling 22 What is a multiple alignment ? The short answer is this – VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG– ATLVCLISDFYPGA--VTVAWKADS– AALGCLVKDYFPEP--VTVSWNSG--VSLTCLVKGFYPSD--IAVEWESNG-- 23 What is M ultiple S equence A lignment Multiple Sequence Alignment ((MSA)? MSA)? MSA is a data mining technique for discovering patterns in a set of comparable sequences. What kind of information is it sensible to apply MSA to? z z Any kind of information can undergo MSA, as long as it can be modelled as a sequence of symbols of a finite alphabet. The best-known example of such modelling are DNA sequences, whose own physical constitution can be immediately translated to a sequence of letters. Applying MSA techniques to these sequences has resulted in the complete description of the human genome. However, MSA is not limited to DNA sequences. Other sequences that can be successfully modelled are: proteins, timelines, many ki nds of linguistic sequences. Since the purpose of aligning sequences is to discover patterns, it only makes sense to align those kinds of information that can be partitioned in different, comparable sequences, and where recurrent patterns can be found. What is the outcome of MSA? z In general terms, an MSA process results in a set of aligned sequences, usally with a calculation of the relative similarity among them, and a model of the alignment, usually with some score of its reliability. This model conveys the recu rrencies found in the set of sequences, and can be expressed in many forms: as a sequence profile that synthesizes the major commonalities between all sequences of the set, as a lattice that accounts for the possible compositions of sequence s, etc How does it work? z MSA is usually implemented as an extension of pairwise alignment. The input to pairwise alignment are two sequences and a similarity criterion or scoring function that describes the similarity between the different symbols that constitute the se quences. The alignment algorithm determines the highest-scoring way to perform insertions, deletions and changes to transform one of the sequences into the other. This results in a new sequence containing the initial pair. 24 RNA & DNA structure MFOLD server (RNA) © Copyright 1995, Michael Zuker, Washington University Medical School Computes optimal and suboptimal foldings of an RNA sequence MFOLD server (DNA) © Copyright 1995, Michael Zuker, Washington University Medical School Computes optimal and suboptimal foldings of a DNA sequence EFN server © Copyright 1995, Michael Zuker, Washington University Medical School - Computes the folding energy of an RNA or DNA secondary structure 25 Network circuit analysis 26 27 28 Membrane Receptors Proposed an atomic level structure for a lattice of serine (Tsr) receptors in coliform bacteria (Shimizu et al., 2000). The structure is a regular two-dimensional lattice in which the cytoplasmic ends of chemotactic receptor dimers insert into a hexagonal array of CheA and CheW molecules. 29 Molecular Interaction Space The Cell is Full of Molecular Machines Published in Trends in Biochemical Sciences 30 Reading list Dan Gusfield, 1997. Algorithms on Strings, Trees and Sequences. Cambridge University Press Searls, D. B., 1992. The linguistics of DNA. Am. Scient. 80: 579-591. Bailey, N.T.J. (1964) The Elements of Stochastic Processes with Applications to the Natural Sciences, Wiley. Taylor, H.M. and Karlin, S. (1994) An Introduction to Stochastic Modelling (Revised Edition), Academic Press. Renshaw, Eric (1991) Modelling Biological Populations in Space and Time, CUP. 32

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Cryptography and Linguistics of Macromolecules Cryptography and