Download No Slide Title

The Complete Arabidopsis Transcriptome MicroArray (CATMA) Project P. Hilson, T. Altmann, S. Aubourg, J. Beynon, F. Bitton, M. Caboche, M. Crowe, P. Dehais, H. Eickhoff, E. Kuhn, S. May, W. Nietfeld, J. Paz-Ares, W. Rensink, P. Reymond, P. Rouzé, U. Schneider, C. Serizet, A. Tabrett, V. Thareau, M. Trick, G. van den Ackerveken, P. Van Hummelen, P. Weisbeek, M. Zabeau http://jic-bioinfo.bbsrc.ac.uk/CATMA/ 2. Automated design of GSTs Introduction Most cDNA clones included in DNA arrays are identified by an EST covering only a portion of their length. The complete clone sequence is generally unknown and is not selected to yield hybridisation results specific to a single gene. ESTs only represent about half the genes identified in model eucaryote genomes. The Specific Primer & Amplicon Design Software (SPADS) selects specific regions within genes and designs primer pairs picked to amplify such regions (Figure 1B; Thareau et al, 2001). The procedure is summarised in the four following steps: To bypass these shortcomings, we are constructing a collection of high quality Gene Specific Tags (GSTs) representing most Arabidopsis genes for use in microarray transcriptome analyses and in other functional genomic approaches. 1. Search for the most specific region within each gene. Each exon is tested with BLASTn against the whole genome sequence and segments with hits are removed. Primer pairs are designed in the remaining regions. If none are detected, the mismatch parameter of BLASTn is decreased and only segments with stringent hits are substracted, thus enlarging the specific remaining regions for primer design. 1. Gene structural annotation 2. Primer design. The specific regions are used as input for the Primer3 software. The identification of each gene in the Arabidopsis genome is at the root of any genome-wide effort to study their expression. Since the structure of only a minority of Arabidopsis genes has been determined experimentally so far, annotation still relies on gene prediction to identify the boundaries of transcription units and of the exon(s) within it (The AGI Consortium, 2000). Using the AGI nuclear genome, we have generated an updated structural annotation of all 5 Arabidopsis chromosomes. The annotation process has been automated. It uses the EuGène software (Schiex et al, 2001) with a unique set of parameters and algorithms applied to all chromosome regions (Figure 1A). Its prediction quality has been tested by matching results against a set of experimentally defined full length cDNA as described by Rouzé and collaborators (Pavy et al., 1999). Quality assessment parameters for chromosome 2 annotation are shown in Table 1. EuGène identifies 29,804 genes in the Arabidopsis nuclear genome, which is higher than the 25,470 identified by the AGI (Figure 2). The detailed comparative analysis of the EuGène and AGI annotations is currently underway. Preliminary observations indicate that EuGène’s higher number results from the combination of several factors: EuGène can predict two genes where AGI annotates one, it predicts genes where none is annotated by AGI (3,369) more often than the contrary (1,533), and it seems biased towards overprediction in pericentromeric regions rich in repeated sequences. A Netstart genomic fragment NetGene2 EuGène Primer3 Blastn Blastx B. Position of GSTs A. Distribution of GST lengths 3000 150-200 bp: 42% 200-300 bp: 36% 300-500 bp: 22% 2500 5’ SPADS GST ATG 3’ UTR stop 5115 (24%) 500 3267 (16%) 12701 (60%) 0 primer specificity 150 180 210 240 270 300 330 360 390 420 450 480 bp Figure 4. GST characteristics Blastn Blastn center CDS 1500 GST specificity RepeatMasker Our GST design is based on expressed sequences (EST or cDNA) or on coding regions predicted by EuGène (i.e. excluding UTR not represented in EST or cDNA). The GST lengths range between 150 and 500 bp which is sufficient to yield reproducible microarray signal for transcriptome analysis (Figure 3). Because of the inherent duplicated nature of the Arabidopsis genome, not all genes will be represented by perfect GSTs. Rejecting candidate sequences that show over 70% identity with another sequence in the Arabidopsis nuclear genome, our process has identified so far a GST for 21,420 (72.0 %) genes out of 29,775 identified on all 5 chromosomes (Figure 2). 1000 gene sequence exon coordinates genes 4. Analysis of amplicon specificity. Each successive amplicon is tested with BLASTn to determine its specificity. If the identity with putative paralogous sequence is over 70%, the amplicon is removed and the next one is processed. GST are searched from 3’ to 5' until one is found. 2000 B SplicePredictor 3. Selection of specific primer pairs. Oligonucleotides designed by Primer3 are tested for specificity with BLASTn against 2 Mb segment containing the gene and are excluded if matches indicate potential unwanted PCR amplification. 3. Structure of the GST collection Figure 1. Gene identification and GST selection Table 1. Assessment of EuGène prediction results PlantGene Araset actual genes 238 correct gene models 182 (76%) partial gene models 50 (21%) split genes 5 (2%) missing genes 1 (0.5%) 51 37 (67%) 14 (27%) 0 0 missing missing exons central in 5' exons 33 12 (2%) (0.7%) actual missing exons exons 1639 51 (3%) 254 15 (6%) 8 (3%) Chromosome I 350 300 300 250 250 200 200 150 150 100 100 50 50 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Mb Chromosome III 0 1 AGI EuGene GST 2 3 300 300 250 250 250 200 200 200 150 150 150 100 100 100 50 50 50 0 0 2 3 4 5 6 7 8 9 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Mb 10 11 12 13 14 15 16 17 Mb Second round PCR with universal primer pairs SS5 5’ ’ U5’ Primary amplicon Genomic BAC DNA 350 300 1 4 Chromosome V Chromosome IV 0 1 (0.4%) First round PCR with specific primer pairs 350 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Mb 2 (0.8 %) Chromosome II 350 350 5 (2%) missing exons wrong in 3' exons 6 1 (0.4%) (0.06%) Each primer designed to synthesize a GST carries a 24 U 5’ col. 4 x gene specific 3’ domain corresponding to the sequence 16 selected by SPADS (18-25 nt) and a 5’ extension (17 nt) added to allow for reamplification of the GSTs with U 3’ row 2 a limited set of universal primers. A set of 40 extensions has been designed so that each sample in a 384-well plate can be amplified witt the unique combination of one row and another column primer, hence avoiding crosscontamination which often plagues the storage and dissemination of large-scale clone collections. The primary amplicons obtained from BAC DNA templates in large excess can be conveniently reamplified and distributed. Also, amplicon production using BAC increases the quality of the GSTs and the fraction of successful PCR amplifications by reducing the complexity of the templates (Figure 5). All GSTs are oriented with regard to transcription with column primers at the 5’ end (see above picture). As of 26 September 2001, the Consortium had PCR amplified 16.280 GSTs. S3’ U3’ 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Mb Figure 2. Gene density according to the Eugène and AGI annotations A. Hybridization (Cy5 cDNA) B. Signal according to length 100000 FLOWER BUD 1000 signal FLOWER BUD LEAF 10000 100 ROOT PLANTLET 10 1 100 predicted gene GST known gene GST intergenic region GST highly expressed cDNA negative control 600 1100 1600 2100 probe length (bp) Figure 3. Transcription profiling with a test set of GSTs Figure 5. Two-step GST amplification Conclusion The project is based on a novel complete unified annotation of the Arabidopsis nuclear genome, generated with our upgraded EuGène software, from which GSTs are selected with SPADS. We are currently studying how best to complement the current GST collection to minimize the presence of non specific probes allowing hybridisation with transcripts from non cognate genes. Given the structure of the GST collection, it can be adapted to a variety of microarray protocols and procedures. It can also serve as a key resource for other large scale functional genomic endeavours based on specific nucleic acid hybridisations, such as systematic Arabidopsis RNAi programmes. References The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796 - 815. Pavy N, Rombauts S, Déhais P, Mathé C, Ramana DV, Leroy P and Rouzé P (1999) Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences. Bioinformatics 15, 887-899. Schiex T, Moisan A and Rouzé P (2001) EuGène: an eukaryotic gene finder that combines several sources of evidence. In “JOBIM 2000, LNCS 2066”, O. Gascuel, M.F. Sagot (Eds.), pp 111-125. Thareau V, Déhais P, Rouzé P and Aubourg S. (2001) Automatic design of gene specific tags for transcriptome studies. Proc. of JOBIM'2001 (Journées Ouvertes Biologie Informatique Mathématiques). Toulouse, France.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download No Slide Title