Download Studying gene expression with genomic data and Codon Adaptation

Studying gene expression with genomic data and Codon Adaptation Index The FAMiCOD Analyser Package M. Ramazzotti, G. Manao, G. Ramponi and D. Degl’Innocenti Dipartimento di Scienze Biochimiche, Università degli Studi di Firenze, viale Morgagni 50 50134 Firenze, Italy. [email protected] www.unifi.it/unifi/scibio/bioinfo/FAMiCOD_Project/famicod_man.html Introduction: All the organisms that have been studied so far have shown a largely different usage of synonymous codons when expressing genes at different levels. The variability seems to be due to the cellular tRNA abundancy and therefore to a different regulation of tRNA and aminoacyl tRNA-synthetase transcription and activity (Ikemura T. 1981 J.Mol.Biol. 146:1-21) The codon usage is not to be considered as an evolutionary constrain since large differences have been found among strictly related organsims. As a result, highly expressed proteins tends to be coded by speciesspecific "optimized" coding sequences composed by the most abundant “anticodons”. The basic meaning of this behaviour is to minimize the risk of tRNA depletion during intense translation and misincorporation of amino acids from rare codons. The analysis of the codons used in the coding sequences of proteins may therefore be an index of protein expression, mirroring the selective pressure of strong promoters. The most simple and sufficiently confident method to estimate codon bias has been proposed to be the Codon Adaptation Index (CAI), which measures the variability of the codon usage in a gene in respect to the variability of a reference set of genes (Sharp P.M., Li W.H. 1987 Nucleic Acids Res. 15:1281-95) Automatic or manual highly biased genes retrieval Data collection and reorganization Codon Usage Tables creation CAI values Randomization Development: The Family Codon (FAMiCOD) Analyser Package is a set of computer programs (Perl scripts, in Linux environment) dedicated to the codon usage analysis and basically to the retrieval and usage of highly expressed genes from whole genome CDS data without the need of experimental resources. As summarized in the scheme above the first step is to collect the data from NCBI FTP genome database. Some reorganization tools are needed from fitting the data with FAMiCOD Package. Then an automated CAI-based approach is able to extract from whole genome dataset the main set of highly biased coding sequences (we called it the “refset”). Now, a devoted tool prepares the Codon Usage Tables from the various datasets (i.e. whole genome, partial refsets and others). Some randomization procedures may be applied to the datasets in order to evaluate the consistency and the robustness of the results. In the end the core CAI calculator apply the CUTs to the coding sequences, indicating for each on them a value which correlates to codon bias and possibly to gene expression. Many other satellite programs are thought to speed up the coding sequences targeted retrieval and the cluster analysis of the results. Of particular interest, according to COG (Cluster of Ortholog Groups) database, the results may be clustered by protein functional role: in brief, we used the COG informations to collect a database of proteins against which to automate local BLAST calls and results parsing. Another useful possibility is given by sliding algorithms which runs through the “chromosomes” exhalting local CAI similarities (e.g. co-transcriptional units, operons). Name CDS chr set avg std go2st go1.65st Ape 1841 1 19 0.40 0.12 49 116 Aae 1560 2 16 0.50 0.06 60 140 Afu 2420 1 24 0.57 0.08 34 105 Mja 1785 3 18 0.56 0.06 44 122 Mka 1687 2 17 0.44 0.12 71 157 Pae 1895 1 19 0.46 0.06 87 120 Pab 2605 1 26 0.47 0.07 60 121 Pho 1955 1 20 0.57 0.05 46 177 Sso 2976 1 30 0.54 0.08 30 63 Tma 1858 1 19 0.57 0.07 43 59 Data from ten completely sequenced hyperthermophilic organisms. Such organisms are included in the COG database and annotations are available for most of their genes. Each chromosome and plasmid has been fused, if needed (to take into account whole organism data, see the column labeled 'chr', chromosome number) checked and subject to automatic search for high codon bias. The resulting set has been used for determining the main codon usage table (green) and contains the number of genes listed in the column labeled 'set'. Some control tables have also been provided, based on whole genome (blue), on a randomized version of the genome (yellow) and on a randomized version of the refset (red). Each table has been used to calculate the CAI values of all the genes of the genome. The bar charts show the frequency of CAI values within intervals of 0.05. The mean and the standard deviation CAI value, together with the amount of genes that lie above 2 or 1.65 standard deviations are also reported (column 'go2st' and 'go1.65st'). From each chart it is clear that if datasets different from the main (green) are used for CAI calculation, the genes presents abherrantly high CAI values. The pie charts describes the reference set of highly biased genes, indicating the functional composition according to COG classification (see below). J : Translation O: Posttranslational modification, protein turnovers C: Energy production and conversion R: General function prediction only P: Inorganis ion transport and methabolism K: Transcription L: Replication, recombination and repair G: Carbohydrate transport and methabolism S: Function unknown A: RNA processing and modification B: Chromatin structure and dynamics D: Cell cycle control, mitosis and meiosis Y: Nuclear structure V: Defence mechanism T: Signal transduction mechanisms M: Cell wall/membrane biogenesis N: Cell motility Z: Cytoskeleton W: Extracellular structures U: Intracellulat trafficking and secretion E: Amino acid transport and methabolism F: Nucleotide transport and methabolism H: Coenzyme methabolism I : Lipid transport and methabolism Q: Secondary methabolites biosynthesis, transport and catabolism Output of the ChromoScan filter. The filter “runs” through the chromosome and performs some operations on the CAI values of a defined number of genes (window). By varying the window size it is possible to locate islets of common expression (or, more properly, of common CAI values) and to study the chromosome topology. In this case a multiplication is operated along the whole chromosome of Pyrococcus abyssi. Each product is then multiplied by 10window to scale the result. In (a) a window of 3 clearly locates a group of three ribosomal proteins. In (b), by using a window of 5 genes, are also located a second ribosomal complex on the left and the pyruvate dehydrogenase complex (together with ketovalerate oxidoreductase) on the right. With a window of 9 (c) the left ribosomal complex in the previous graph is still present and a new large ribosomal complex is indicated: by observing the genome annotations one can note a series of at least 20 genes belonging to the J class (according to COG, this class contains translation associated proteins). On the right the main peak is still in correspondance of the pyruvate dehydrogenase complex (class S), and the sharpness seems to indicate that the complex is surrounded by some other highly expressed elements. a Archaeoglobus fulgidus Methanopyrus kandleri Methanococcus jannaschii 1100 1200 2200 1000 1000 1000 1100 2000 900 900 800 800 1000 900 1800 900 800 refset rrefset genome random 700 600 500 800 700 600 500 400 1600 refset rrefset genome random 1400 1200 1000 700 refset rrefset genome random 600 500 400 800 400 300 700 refset rrefset genome random 600 500 400 300 300 400 200 200 100 200 100 100 0 0 0 0 300 600 200 200 100 0 G1 S1 L1 J3 refset rrefset genome random U1 S2 P1 S1 L1 K1 H2 J3 L1 P1 R2 E2 K1 J 10 C1 J9 J9 R1 G1 R3 C8 O4 C3 O1 2500 Pyrococcus abyssi 1400 1300 1300 2250 2000 1500 1000 refset Pyrococcus horikoshii 1700 1100 1000 1300 refset 900 1200 800 rrefset 800 rrefset genome 700 genome 700 random 600 random 600 500 500 1600 1400 1400 refset 1100 rrefset 1000 genome random 400 1300 1200 1100 refset 800 rrefset genome 700 random 900 1000 900 800 700 600 600 500 500 300 300 400 400 200 200 300 300 250 100 100 200 200 100 0 0 100 0 400 750 500 J2 F1 0 E2 refset rrefset genome random 0 Q1 E2 M1 Thermotoga maritima 1500 1500 1100 900 Sulfolobus solfataricus 1600 1200 1200 1000 R6 O1 O3 Pyrobaculum aerophilum 1750 L2 C5 C2 1250 COG functional classification Aquifex aeolicus Aeropyrum pernix J4 H1 H1 F1 S2 G1 V2 E2 N2 O4 J6 J 10 M1 L2 T1 V2 J 10 G4 K1 R4 O3 S1 S1 S8 C1 L2 O2 O2 C1 O2 K1 L1 R2 G2 L1 K6 K2 R2 P1 C3 Validation: we validated our method by comparing our reference dataset with others obtained with various methods. In particular the system was validated for Escherichia coli, Bacillus subtilis and Haemophilus influenzae whose dataset were produced with computational methods by A. Carbone (Carbone A. et.al 2002, Bioinformatics 19:2005-15) and compared with microarray data, obtaining a strong positive correlation. Results: owr work is not really based on hyperthermophilic, here presented as an example of how b c an ecologically homogeneous group may be different in terms of gene expression. One can notice that only when a correct dataset is used for “inference” on gene expression the genes are distributed according to a normal-like function which possess a reasonably low average value. In fact, when randomized dataset are used (both genomic randomization and reference set randomization) the genes present constant and high CAI values which only partially correlates with the underlaying codon usage bias. This fact is due to the non-biased codon usage in the codon weight tables: since there is no preference, codons display an homogeneously high weight leading to high CAI values. When a correct dataset is used, generated with our automatic method, some element of the J group, involved in protein production (e.g. ribosomal proteins and transcription factor) is always present and generally predominant. This codon disparity, called “translational bias”, is supported by a number of experimental evidences for fast growing organisms, but it is poorly characterized for organisms whose genomic data are the sole source of information. The presence of the J class in reference sets therefore gives an additional meaning to other highly biased genes, and the very different ecological pressure (apart from temperature) among organisms may explain the non-uniform distribution of the other COG classes. Particular attenction is required for S and K COG classes, containing poorly or uncharacterized proteins: their presence in highly biased sets should be considered a limiting step in global analysis.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Studying gene expression with genomic data and Codon Adaptation