Download Studying gene expression with genomic data and Codon Adaptation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Secreted frizzled-related protein 1 wikipedia , lookup

List of types of proteins wikipedia , lookup

Community fingerprinting wikipedia , lookup

X-inactivation wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Gene expression wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Gene regulatory network wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Expanded genetic code wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Ridge (biology) wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene wikipedia , lookup

Genomic library wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genetic code wikipedia , lookup

Molecular evolution wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Studying gene expression with genomic data and
Codon Adaptation Index
The FAMiCOD Analyser Package
M. Ramazzotti, G. Manao, G. Ramponi and D. Degl’Innocenti
Dipartimento di Scienze Biochimiche, Università degli Studi di Firenze, viale Morgagni 50 50134 Firenze, Italy.
[email protected]
www.unifi.it/unifi/scibio/bioinfo/FAMiCOD_Project/famicod_man.html
Introduction: All the organisms that have been studied so far have shown a largely different usage of synonymous
codons when expressing genes at different levels. The variability seems to be due to the cellular tRNA abundancy and
therefore to a different regulation of tRNA and aminoacyl tRNA-synthetase transcription and activity (Ikemura T. 1981
J.Mol.Biol. 146:1-21) The codon usage is not to be considered as an evolutionary constrain since large differences have
been found among strictly related organsims. As a result, highly expressed proteins tends to be coded by speciesspecific "optimized" coding sequences composed by the most abundant “anticodons”. The basic meaning of this
behaviour is to minimize the risk of tRNA depletion during intense translation and misincorporation of amino acids from
rare codons. The analysis of the codons used in the coding sequences of proteins may therefore be an index of protein
expression, mirroring the selective pressure of strong promoters. The most simple and sufficiently confident method to
estimate codon bias has been proposed to be the Codon Adaptation Index (CAI), which measures the variability of the
codon usage in a gene in respect to the variability of a reference set of genes (Sharp P.M., Li W.H. 1987 Nucleic Acids
Res. 15:1281-95)
Automatic or
manual highly
biased genes
retrieval
Data collection and
reorganization
Codon Usage
Tables creation
CAI values
Randomization
Development: The Family Codon (FAMiCOD) Analyser Package is a set of computer programs (Perl scripts, in Linux environment) dedicated to the codon usage analysis and basically to the retrieval and usage of highly expressed genes
from whole genome CDS data without the need of experimental resources. As summarized in the scheme above the first step is to collect the data from NCBI FTP genome database. Some reorganization tools are needed from fitting the data
with FAMiCOD Package. Then an automated CAI-based approach is able to extract from whole genome dataset the main set of highly biased coding sequences (we called it the “refset”). Now, a devoted tool prepares the Codon Usage Tables
from the various datasets (i.e. whole genome, partial refsets and others). Some randomization procedures may be applied to the datasets in order to evaluate the consistency and the robustness of the results. In the end the core CAI calculator
apply the CUTs to the coding sequences, indicating for each on them a value which correlates to codon bias and possibly to gene expression. Many other satellite programs are thought to speed up the coding sequences targeted retrieval and
the cluster analysis of the results. Of particular interest, according to COG (Cluster of Ortholog Groups) database, the results may be clustered by protein functional role: in brief, we used the COG informations to collect a database of proteins
against which to automate local BLAST calls and results parsing. Another useful possibility is given by sliding algorithms which runs through the “chromosomes” exhalting local CAI similarities (e.g. co-transcriptional units, operons).
Name CDS chr set avg std go2st go1.65st
Ape 1841 1 19 0.40 0.12 49 116
Aae 1560 2 16 0.50 0.06 60 140
Afu 2420 1 24 0.57 0.08 34 105
Mja 1785 3 18 0.56 0.06 44 122
Mka 1687 2 17 0.44 0.12 71 157
Pae 1895 1 19 0.46 0.06 87 120
Pab 2605 1 26 0.47 0.07 60 121
Pho 1955 1 20 0.57 0.05 46 177
Sso 2976 1 30 0.54 0.08 30 63
Tma 1858 1 19 0.57 0.07 43 59
Data from ten completely sequenced hyperthermophilic organisms. Such
organisms are included in the COG database and annotations are available for
most of their genes. Each chromosome and plasmid has been fused, if needed
(to take into account whole organism data, see the column labeled 'chr',
chromosome number) checked and subject to automatic search for high codon
bias. The resulting set has been used for determining the main codon usage table
(green) and contains the number of genes listed in the column labeled 'set'. Some
control tables have also been provided, based on whole genome (blue), on a
randomized version of the genome (yellow) and on a randomized version of the
refset (red). Each table has been used to calculate the CAI values of all the genes
of the genome.
The bar charts show the frequency of CAI values within intervals of 0.05. The
mean and the standard deviation CAI value, together with the amount of genes
that lie above 2 or 1.65 standard deviations are also reported (column 'go2st' and
'go1.65st'). From each chart it is clear that if datasets different from the main
(green) are used for CAI calculation, the genes presents abherrantly high CAI
values.
The pie charts describes the reference set of highly biased genes, indicating
the functional composition according to COG classification (see below).
J : Translation
O: Posttranslational modification, protein turnovers
C: Energy production and conversion
R: General function prediction only
P: Inorganis ion transport and methabolism
K: Transcription
L: Replication, recombination and repair
G: Carbohydrate transport and methabolism
S: Function unknown
A: RNA processing and modification
B: Chromatin structure and dynamics
D: Cell cycle control, mitosis and meiosis
Y: Nuclear structure
V: Defence mechanism
T: Signal transduction mechanisms
M: Cell wall/membrane biogenesis
N: Cell motility
Z: Cytoskeleton
W: Extracellular structures
U: Intracellulat trafficking and secretion
E: Amino acid transport and methabolism
F: Nucleotide transport and methabolism
H: Coenzyme methabolism
I : Lipid transport and methabolism
Q: Secondary methabolites biosynthesis, transport and catabolism
Output of the ChromoScan
filter. The filter “runs” through the
chromosome and performs some
operations on the CAI values of a
defined number of genes (window).
By varying the window size it is
possible to locate islets of common
expression (or, more properly, of
common CAI values) and to study
the chromosome topology. In this
case a multiplication is operated
along the whole chromosome of
Pyrococcus abyssi. Each
product is then multiplied by 10window
to scale the result. In (a) a window of
3 clearly locates a group of three
ribosomal proteins. In (b), by using a
window of 5 genes, are also located
a second ribosomal complex on the
left and the pyruvate dehydrogenase
complex (together with ketovalerate
oxidoreductase) on the right. With a
window of 9 (c) the left ribosomal
complex in the previous graph is still
present and a new large ribosomal
complex is indicated: by observing
the genome annotations one can
note a series of at least 20 genes
belonging to the J class (according
to COG, this class contains
translation associated proteins). On
the right the main peak is still in
correspondance of the pyruvate
dehydrogenase complex (class S),
and the sharpness seems to indicate
that the complex is surrounded by
some other highly expressed
elements.
a
Archaeoglobus fulgidus
Methanopyrus kandleri
Methanococcus jannaschii
1100
1200
2200
1000
1000
1000
1100
2000
900
900
800
800
1000
900
1800
900
800
refset
rrefset
genome
random
700
600
500
800
700
600
500
400
1600
refset
rrefset
genome
random
1400
1200
1000
700
refset
rrefset
genome
random
600
500
400
800
400
300
700
refset
rrefset
genome
random
600
500
400
300
300
400
200
200
100
200
100
100
0
0
0
0
300
600
200
200
100
0
G1
S1
L1
J3
refset
rrefset
genome
random
U1
S2
P1
S1
L1
K1
H2
J3
L1
P1
R2
E2
K1
J 10
C1
J9
J9
R1
G1
R3
C8
O4
C3
O1
2500
Pyrococcus abyssi
1400
1300
1300
2250
2000
1500
1000
refset
Pyrococcus horikoshii
1700
1100
1000
1300
refset
900
1200
800
rrefset
800
rrefset
genome
700
genome
700
random
600
random
600
500
500
1600
1400
1400
refset
1100
rrefset
1000
genome
random
400
1300
1200
1100
refset
800
rrefset
genome
700
random
900
1000
900
800
700
600
600
500
500
300
300
400
400
200
200
300
300
250
100
100
200
200
100
0
0
100
0
400
750
500
J2
F1
0
E2
refset
rrefset
genome
random
0
Q1
E2
M1
Thermotoga maritima
1500
1500
1100
900
Sulfolobus solfataricus
1600
1200
1200
1000
R6
O1
O3
Pyrobaculum aerophilum
1750
L2
C5
C2
1250
COG functional classification
Aquifex aeolicus
Aeropyrum pernix
J4
H1
H1
F1
S2
G1
V2
E2
N2
O4
J6
J 10
M1
L2
T1
V2
J 10
G4
K1
R4
O3
S1
S1
S8
C1
L2
O2
O2
C1
O2
K1
L1
R2
G2
L1
K6
K2
R2
P1
C3
Validation: we validated our method by comparing our reference dataset with others obtained with
various methods. In particular the system was validated for Escherichia coli, Bacillus subtilis and
Haemophilus influenzae whose dataset were produced with computational methods by A. Carbone
(Carbone A. et.al 2002, Bioinformatics 19:2005-15) and compared with microarray data, obtaining a
strong positive correlation.
Results: owr work is not really based on hyperthermophilic, here presented as an example of how
b
c
an ecologically homogeneous group may be different in terms of gene expression. One can notice that
only when a correct dataset is used for “inference” on gene expression the genes are distributed
according to a normal-like function which possess a reasonably low average value. In fact, when
randomized dataset are used (both genomic randomization and reference set randomization) the genes
present constant and high CAI values which only partially correlates with the underlaying codon usage
bias. This fact is due to the non-biased codon usage in the codon weight tables: since there is no
preference, codons display an homogeneously high weight leading to high CAI values. When a correct
dataset is used, generated with our automatic method, some element of the J group, involved in protein
production (e.g. ribosomal proteins and transcription factor) is always present and generally
predominant. This codon disparity, called “translational bias”, is supported by a number of experimental
evidences for fast growing organisms, but it is poorly characterized for organisms whose genomic data
are the sole source of information. The presence of the J class in reference sets therefore gives an
additional meaning to other highly biased genes, and the very different ecological pressure (apart from
temperature) among organisms may explain the non-uniform distribution of the other COG classes.
Particular attenction is required for S and K COG classes, containing poorly or uncharacterized proteins:
their presence in highly biased sets should be considered a limiting step in global analysis.