* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Cryptography and Linguistics of Macromolecules Cryptography and
DNA damage theory of aging wikipedia , lookup
Genome evolution wikipedia , lookup
Frameshift mutation wikipedia , lookup
Non-coding RNA wikipedia , lookup
Pathogenomics wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Genealogical DNA test wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
Transposable element wikipedia , lookup
Cancer epigenetics wikipedia , lookup
DNA barcoding wikipedia , lookup
Nutriepigenomics wikipedia , lookup
History of RNA biology wikipedia , lookup
DNA supercoil wikipedia , lookup
Genomic library wikipedia , lookup
Designer baby wikipedia , lookup
Molecular cloning wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
Nucleic acid double helix wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Epigenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genetic code wikipedia , lookup
DNA vaccination wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Human genome wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Microevolution wikipedia , lookup
Primary transcript wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Microsatellite wikipedia , lookup
Genome editing wikipedia , lookup
Point mutation wikipedia , lookup
Non-coding DNA wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Metagenomics wikipedia , lookup
Sequence alignment wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Cryptography and Linguistics of
Macromolecules
Studying the codes and language of life
George Michaels
April 5, 2004
2
3
<cryptography>
The practise and study of encryption and
decryption - encoding data so that it can only be
decoded by specific individuals.
A system for encrypting and decrypting data is a
cryptosystem. These usually involve an algorithm
for combining the original data ("plaintext") with
one or more "keys" - numbers or strings of
characters known only to
the sender and/or recipient. The resulting output is
known as "ciphertext".
4
5
6
7
Linguistic analysis
The study of the nature, structure, and variation
of language, including phonetics, phonology,
morphology, syntax, semantics, sociolinguistics,
and pragmatics.
Living systems have context free and context
dependent patterns of communication
z
Statistics
- Frequency
- Pattern Analysis
Lexicon
- Predefined concepts
Linguistic
- Syntax
- Semantic
- Logic
Concept extraction
8
Chomsky Hierarchy of Languages
Biological systems Î
9
10
DNA Linguistics
David Searls is probably the first to use
sophisticated linguistic methods in the analysis and
classification of features in DNA sequences. He
also originally suggested the idea that grammars
could be used to describe the structure of nonmacromolecules, presenting a very early version of
what later became our terminal form.
A pleasant introduction to the application of
linguistics to DNA sequence analysis can be found
in Searls, D. B., 1992. The linguistics of DNA. Am.
Scient. 80: 579-591. The
11
12
13
14
DNA & RNA patterns
15
tRNA grammars
16
Protein Domain Interactions
17
Composition analysis
Check for Putative Sequencing Errors
z
z
ORF Finder
Detection of Frameshift Sequencing Errors - a more sophisticated
alignment procedure
Codon usage, composition
Search Genes and Coding Regions
codontree: codon usage table, distance matrix and bases composition
(Pesole,Attimonelli and Liuni).
codonW: a package for codon usage analysis. It was designed to
simplify Multivariate Analysis (MVA) of codon usage (John Peden). .
EMBOSS:
-
cusp: Create a codon usage table.
chips: Codon usage statistics.
codcmp: Codon usage table comparison.
syco: Synonymous codon usage Gribskov statistic plot.
wordcount: Counts words of a specified size in a DNA sequence.
geecee: Calculates the fractional GC content of nucleic acid sequences.
18
Gene Searching
Frame Plot Analysis predicts protein-coding region of high G+C content bacterial DNA
ORF Finder identifies all open reading frames using the standard or alternative genetic
codes
Gene recognition by Spliced Alignment PROCRUSTES - User-defined exon-intron
structure of genomic sequence is used for computing the quality of prediction.
Gene Recognition and Assembly Internet Link Grail Version 1.3
GENSCAN Web Server Identification of complete gene structures in genomic DNA
GeneID Web Server Gene Identification and Gene Structure Prediction in Genomic DNA
Sequences
Genie: A Gene Finder Based on Generalized Hidden Markov Models
Computational Genomics Group of the Sanger Centre Finding genes, promoters, poly-A,
splice sites in different organisms
BCM Gene Finder Splice sites, Protein coding exons and Gene model construction,
Promotor and poly-A region recognition
MZEF Human Internal Coding Exon Finder by Michael Zhang
HMMgene Exon Finder
Splice Site Prediction by Neural Network finds possible donor and acceptor sites.
The FINEX project (intron/exon boundary phase and the exon length)
Pol3scan searches the eukaryotic polymerase III intragenic control regions A box and B
box.
tRNA analysis detection of transfer RNAs
19
Protein 3D structure
is determined by
the primary amino
acid sequence and
environment.
3 Secondary
structural units:
• Coiled
• β sheet
• Random
20
Pattern Searching
MEME Motif discovery tool
Promoter Prediction by Neural Network
PROMOTER SCAN II Presently it is limited to mammalian promoter
sequences
WWW Promoter Scan Vers 1.7 predicts Promoter regions based on
scoring homologies with putative eukaryotic Pol II promoter sequences
MatInspector V2.0 wiht TRANSFAC
PatternSearch with TRANSFAC sites
Pratt discovers patterns conserved in sets of unaligned protein
sequences.
FPAT Regular Expression Searches of Sequence Databases.
Search database of consensus sequences from repeat families
many Blasts
RepeatMasker2 Web Server
fuzznuc: Nucleic acid pattern search (EMBOSS).
fuzztran: Protein pattern search after translation (EMBOSS)
21
Alignments
Fasta : pariwise with gaps
Smith-Waterman exhaustive pairwise for optimal
alignments
Multiple Sequence Alignment CLUSTALW and
MSA.
The USC Sequence Alignment Server
ToPLign: alignment methods with flexible
parameter handling
22
What is a multiple alignment ?
The short answer is this –
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG–
ATLVCLISDFYPGA--VTVAWKADS–
AALGCLVKDYFPEP--VTVSWNSG--VSLTCLVKGFYPSD--IAVEWESNG--
23
What is M
ultiple S
equence A
lignment
Multiple
Sequence
Alignment
((MSA)?
MSA)?
MSA is a data mining technique for discovering patterns in a set of comparable
sequences.
What kind of information is it sensible to apply MSA to?
z
z
Any kind of information can undergo MSA, as long as it can be modelled as a sequence of
symbols of a finite alphabet. The best-known example of such modelling are DNA sequences,
whose own physical constitution can be immediately translated to a sequence of letters. Applying
MSA techniques to these sequences has resulted in the complete description of the human
genome. However, MSA is not limited to DNA sequences. Other sequences that can be
successfully modelled are: proteins, timelines, many ki nds of linguistic sequences.
Since the purpose of aligning sequences is to discover patterns, it only makes sense to align
those kinds of information that can be partitioned in different, comparable sequences, and where
recurrent patterns can be found.
What is the outcome of MSA?
z
In general terms, an MSA process results in a set of aligned sequences, usally with a calculation
of the relative similarity among them, and a model of the alignment, usually with some score of
its reliability. This model conveys the recu rrencies found in the set of sequences, and can be
expressed in many forms: as a sequence profile that synthesizes the major commonalities
between all sequences of the set, as a lattice that accounts for the possible compositions of
sequence s, etc
How does it work?
z
MSA is usually implemented as an extension of pairwise alignment. The input to pairwise
alignment are two sequences and a similarity criterion or scoring function that describes the
similarity between the different symbols that constitute the se quences. The alignment algorithm
determines the highest-scoring way to perform insertions, deletions and changes to transform one
of the sequences into the other. This results in a new sequence containing the initial pair.
24
RNA & DNA structure
MFOLD server (RNA) © Copyright 1995, Michael
Zuker, Washington University Medical School Computes optimal and suboptimal foldings of an
RNA sequence
MFOLD server (DNA) © Copyright 1995, Michael
Zuker, Washington University Medical School Computes optimal and suboptimal foldings of a
DNA sequence
EFN server © Copyright 1995, Michael Zuker,
Washington University Medical School - Computes
the folding energy of an RNA or DNA secondary
structure
25
Network circuit analysis
26
27
28
Membrane Receptors
Proposed an atomic level structure for a lattice of serine (Tsr) receptors in coliform bacteria (Shimizu et al.,
2000). The structure is a regular two-dimensional lattice in which the cytoplasmic ends of chemotactic
receptor dimers insert into a hexagonal array of CheA and CheW molecules.
29
Molecular Interaction Space
The Cell is Full of Molecular Machines
Published in Trends in Biochemical Sciences
30
Reading list
Dan Gusfield, 1997. Algorithms on Strings, Trees
and Sequences. Cambridge University Press
Searls, D. B., 1992. The linguistics of DNA. Am.
Scient. 80: 579-591.
Bailey, N.T.J. (1964) The Elements of Stochastic
Processes with Applications to the Natural
Sciences, Wiley.
Taylor, H.M. and Karlin, S. (1994) An Introduction
to Stochastic Modelling (Revised Edition),
Academic Press.
Renshaw, Eric (1991) Modelling Biological
Populations in Space and Time, CUP.
32