Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Hierarchical Cluster Structures
and Symmetries
in Genomic Sequences
Andrei Zinovyev
Institut des Hautes Études Scientifiques
Math@Bio group of M.Gromov
Plan of the talk
Genomic sequences:
geometric approach, clustering
Genomic sequence as text
Basic 7-cluster structure
Global structure of codon frequencies
Internal structure of codon frequencies
Applications
Introduction
Frequency dictionaries
Genomic sequence
as a text in unknown language
..cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc…
frequency dictionaries:
N = 4=41
N = 16=42
N = 64=43
N=256=44
From text to geometry
cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc
107
length~300-400
cgtggtgagctgatgctagggrcgcac
ggtgagctgatgctagggrcgcacact
tgagctgatgctagggrcgcacaattc
gtgagctgatgctagggrcgcacggtg
……
gagctgatgctagggrcgcacaagtga
3000-4000 fragments
RN
Method of visualization
principal components analysis
RN
PCA plot
R2
Chapter 1
Basic 7-cluster structure
(level 1 of non-randomness)
Caulobacter crescentus
singles
N=4
doublets
N=16
triplets
N=64
quadruplets
N=256
!!!
the information in genomic sequence is encoded
by non-overlapping triplets
First explanation
cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc
Basic 7-cluster structure
gtgagctgatgctagggrcgcacgtggtgagc
gct gat gct agg grc gca cgt
ctg atg cta ggg rcg cac gtg
tga tgc tag ggr cgc acg tgg
gtgaatcggtgggtgaqtgtgctgctatgagc
atc ggt ggg tga gtg tgc tgc
tcg gtg ggt gag tgt gct gct
cgg tgg gtg agt gtg ctg ctg
Non-coding parts
Point mutations:
insertions, deletions
a
gtgagctgatgctagggr cgcacgaat
Mean-field approximation
for triplet frequencies
FIJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ):
FAAA , FAAT , FAAC … FGGC , FGGG : 64 numbers
letter frequency + correlations
FIJK P P P
1 2 3
I J K
: 12 numbers
Pi
j
Why hexagonal symmetry?
GC-content =
-+0
0+-
+-0
+0-
0-+
-0+
PC + PG
Chapter 2
Global structure of
codon frequencies
(143 complete bacterial
genomes)
Genome codon usage
and mean-field approximation
correct frameshift
…
ggtgaATG gat gct agg … gtc gca cgc TAAtgagct
64 frequencies FIJK
…
ggtgaATG gat gct agg … gtc gca cgc TAAtgagct
12 frequencies PI1 , PJ2 , PK3
Global structure
of codon frequencies
PIJ are linear functions of GC-content
Four symmetry types
of the basic 7-cluster structure
parallel
triangles
perpendicular
triangles
degenerated
flower-like
Chapter 3
Internal structure of
codon frequencies
(level 2 of non-randomness)
Second level of hierarchy
?
Distribution of genes
function2
function1
function3
R64
Fast-growing bacteria
I
III
II
IV
Genes of class I
(most of)
Genes of class II
(higly expressed)
Genes of class III
(unusual)
Genes of class IV
(hydrophobic
proteins)
Escherichia coli
Genes of class I
(most of)
Genes of class II
(higly expressed)
Genes of class III
(unusual)
Genes of class IV
(hydrophobic
proteins)
Chapter 4
Applications
Computational gene
prediction
Accuracy >90%
Protein expression
optimization
gene sequence S,
protein A
I
III
II
IV
gene sequence S’,
same protein A,
higher expression
Web-site
cluster structures in genomic sequences
http://www.ihes.fr/~zinovyev/7clusters
Papers
Gorban A, Popova T, Zinovyev A
Four basic symmetry types in the universal 7-cluster
structure of 143 complete bacterial genomic sequences.
2004. Arxive e-print.
Gorban A, Zinovyev A, Popova T
Seven clusters in genomic triplet distributions.
2003. In Silico Biology. V.3, 0039.
Zinovyev A, Gorban A, Popova T
Self-Organizing Approach
for Automated Gene Identification.
2003. Open Systems and Information Dynamics 10 (4).
People
Dr. Tanya Popova
Institute of
Computational Modeling
Russia
Professor
Alexander Gorban
University of Leicester
UK