Download Cryptography and Linguistics of Macromolecules Cryptography and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DNA damage theory of aging wikipedia , lookup

Genome evolution wikipedia , lookup

Frameshift mutation wikipedia , lookup

Non-coding RNA wikipedia , lookup

Pathogenomics wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Genealogical DNA test wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Transposable element wikipedia , lookup

Cancer epigenetics wikipedia , lookup

DNA barcoding wikipedia , lookup

Nucleosome wikipedia , lookup

Nutriepigenomics wikipedia , lookup

History of RNA biology wikipedia , lookup

DNA supercoil wikipedia , lookup

Genomic library wikipedia , lookup

Designer baby wikipedia , lookup

Molecular cloning wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Epigenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genetic code wikipedia , lookup

DNA vaccination wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Human genome wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene wikipedia , lookup

Microevolution wikipedia , lookup

Primary transcript wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Deoxyribozyme wikipedia , lookup

RNA-Seq wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Microsatellite wikipedia , lookup

Genome editing wikipedia , lookup

Point mutation wikipedia , lookup

Non-coding DNA wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Metagenomics wikipedia , lookup

Sequence alignment wikipedia , lookup

Genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Helitron (biology) wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Transcript
Cryptography and Linguistics of
Macromolecules
Studying the codes and language of life
George Michaels
April 5, 2004
2
3
<cryptography>
The practise and study of encryption and
decryption - encoding data so that it can only be
decoded by specific individuals.
A system for encrypting and decrypting data is a
cryptosystem. These usually involve an algorithm
for combining the original data ("plaintext") with
one or more "keys" - numbers or strings of
characters known only to
the sender and/or recipient. The resulting output is
known as "ciphertext".
4
5
6
7
Linguistic analysis
The study of the nature, structure, and variation
of language, including phonetics, phonology,
morphology, syntax, semantics, sociolinguistics,
and pragmatics.
Living systems have context free and context
dependent patterns of communication
z
Statistics
- Frequency
- Pattern Analysis
„
Lexicon
- Predefined concepts
„
Linguistic
- Syntax
- Semantic
- Logic
„
Concept extraction
8
Chomsky Hierarchy of Languages
Biological systems Î
9
10
DNA Linguistics
David Searls is probably the first to use
sophisticated linguistic methods in the analysis and
classification of features in DNA sequences. He
also originally suggested the idea that grammars
could be used to describe the structure of nonmacromolecules, presenting a very early version of
what later became our terminal form.
A pleasant introduction to the application of
linguistics to DNA sequence analysis can be found
in Searls, D. B., 1992. The linguistics of DNA. Am.
Scient. 80: 579-591. The
11
12
13
14
DNA & RNA patterns
15
tRNA grammars
16
Protein Domain Interactions
17
Composition analysis
Check for Putative Sequencing Errors
z
z
ORF Finder
Detection of Frameshift Sequencing Errors - a more sophisticated
alignment procedure
Codon usage, composition
„
„
„
„
Search Genes and Coding Regions
codontree: codon usage table, distance matrix and bases composition
(Pesole,Attimonelli and Liuni).
codonW: a package for codon usage analysis. It was designed to
simplify Multivariate Analysis (MVA) of codon usage (John Peden). .
EMBOSS:
-
cusp: Create a codon usage table.
chips: Codon usage statistics.
codcmp: Codon usage table comparison.
syco: Synonymous codon usage Gribskov statistic plot.
wordcount: Counts words of a specified size in a DNA sequence.
geecee: Calculates the fractional GC content of nucleic acid sequences.
18
Gene Searching
Frame Plot Analysis predicts protein-coding region of high G+C content bacterial DNA
ORF Finder identifies all open reading frames using the standard or alternative genetic
codes
Gene recognition by Spliced Alignment PROCRUSTES - User-defined exon-intron
structure of genomic sequence is used for computing the quality of prediction.
Gene Recognition and Assembly Internet Link Grail Version 1.3
GENSCAN Web Server Identification of complete gene structures in genomic DNA
GeneID Web Server Gene Identification and Gene Structure Prediction in Genomic DNA
Sequences
Genie: A Gene Finder Based on Generalized Hidden Markov Models
Computational Genomics Group of the Sanger Centre Finding genes, promoters, poly-A,
splice sites in different organisms
BCM Gene Finder Splice sites, Protein coding exons and Gene model construction,
Promotor and poly-A region recognition
MZEF Human Internal Coding Exon Finder by Michael Zhang
HMMgene Exon Finder
Splice Site Prediction by Neural Network finds possible donor and acceptor sites.
The FINEX project (intron/exon boundary phase and the exon length)
Pol3scan searches the eukaryotic polymerase III intragenic control regions A box and B
box.
tRNA analysis detection of transfer RNAs
19
Protein 3D structure
is determined by
the primary amino
acid sequence and
environment.
3 Secondary
structural units:
• Coiled
• β sheet
• Random
20
Pattern Searching
MEME Motif discovery tool
Promoter Prediction by Neural Network
PROMOTER SCAN II Presently it is limited to mammalian promoter
sequences
WWW Promoter Scan Vers 1.7 predicts Promoter regions based on
scoring homologies with putative eukaryotic Pol II promoter sequences
MatInspector V2.0 wiht TRANSFAC
PatternSearch with TRANSFAC sites
Pratt discovers patterns conserved in sets of unaligned protein
sequences.
FPAT Regular Expression Searches of Sequence Databases.
Search database of consensus sequences from repeat families
many Blasts
RepeatMasker2 Web Server
fuzznuc: Nucleic acid pattern search (EMBOSS).
fuzztran: Protein pattern search after translation (EMBOSS)
21
Alignments
Fasta : pariwise with gaps
Smith-Waterman exhaustive pairwise for optimal
alignments
Multiple Sequence Alignment CLUSTALW and
MSA.
The USC Sequence Alignment Server
ToPLign: alignment methods with flexible
parameter handling
22
What is a multiple alignment ?
The short answer is this –
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG–
ATLVCLISDFYPGA--VTVAWKADS–
AALGCLVKDYFPEP--VTVSWNSG--VSLTCLVKGFYPSD--IAVEWESNG--
23
What is M
ultiple S
equence A
lignment
Multiple
Sequence
Alignment
((MSA)?
MSA)?
MSA is a data mining technique for discovering patterns in a set of comparable
sequences.
What kind of information is it sensible to apply MSA to?
z
z
Any kind of information can undergo MSA, as long as it can be modelled as a sequence of
symbols of a finite alphabet. The best-known example of such modelling are DNA sequences,
whose own physical constitution can be immediately translated to a sequence of letters. Applying
MSA techniques to these sequences has resulted in the complete description of the human
genome. However, MSA is not limited to DNA sequences. Other sequences that can be
successfully modelled are: proteins, timelines, many ki nds of linguistic sequences.
Since the purpose of aligning sequences is to discover patterns, it only makes sense to align
those kinds of information that can be partitioned in different, comparable sequences, and where
recurrent patterns can be found.
What is the outcome of MSA?
z
In general terms, an MSA process results in a set of aligned sequences, usally with a calculation
of the relative similarity among them, and a model of the alignment, usually with some score of
its reliability. This model conveys the recu rrencies found in the set of sequences, and can be
expressed in many forms: as a sequence profile that synthesizes the major commonalities
between all sequences of the set, as a lattice that accounts for the possible compositions of
sequence s, etc
How does it work?
z
MSA is usually implemented as an extension of pairwise alignment. The input to pairwise
alignment are two sequences and a similarity criterion or scoring function that describes the
similarity between the different symbols that constitute the se quences. The alignment algorithm
determines the highest-scoring way to perform insertions, deletions and changes to transform one
of the sequences into the other. This results in a new sequence containing the initial pair.
24
RNA & DNA structure
MFOLD server (RNA) © Copyright 1995, Michael
Zuker, Washington University Medical School Computes optimal and suboptimal foldings of an
RNA sequence
MFOLD server (DNA) © Copyright 1995, Michael
Zuker, Washington University Medical School Computes optimal and suboptimal foldings of a
DNA sequence
EFN server © Copyright 1995, Michael Zuker,
Washington University Medical School - Computes
the folding energy of an RNA or DNA secondary
structure
25
Network circuit analysis
26
27
28
Membrane Receptors
Proposed an atomic level structure for a lattice of serine (Tsr) receptors in coliform bacteria (Shimizu et al.,
2000). The structure is a regular two-dimensional lattice in which the cytoplasmic ends of chemotactic
receptor dimers insert into a hexagonal array of CheA and CheW molecules.
29
Molecular Interaction Space
The Cell is Full of Molecular Machines
Published in Trends in Biochemical Sciences
30
Reading list
Dan Gusfield, 1997. Algorithms on Strings, Trees
and Sequences. Cambridge University Press
Searls, D. B., 1992. The linguistics of DNA. Am.
Scient. 80: 579-591.
Bailey, N.T.J. (1964) The Elements of Stochastic
Processes with Applications to the Natural
Sciences, Wiley.
Taylor, H.M. and Karlin, S. (1994) An Introduction
to Stochastic Modelling (Revised Edition),
Academic Press.
Renshaw, Eric (1991) Modelling Biological
Populations in Space and Time, CUP.
32