Download ExScript: AN `EX`-CENTRIC APPROACH TO THE DESCRIPTION OF

BIOINFORMATICS Editorial ExScript: AN ‘EX’-CENTRIC APPROACH TO THE DESCRIPTION OF TRANSCRIPT DIVERSITY What is the relationship between exon structure and the diversity of gene expression? Publication of the human genome sequence has provided fewer genes than expected. In contrast, transcript to genome comparisons are beginning to show that at least half of all genes produce more than one transcript. In addition, we have been presented with an unexpectedly high diversity of transcript forms. In order to explore and characterise diversity in the context of the expression state under which it has been captured, we need to be able to describe the variation of gene expression products in a robust manner, with reference to the structure of the underlying genes and the state under which the product was expressed. If we accept that there is significant diversity in expression forms of genes, there is a need to define the isoform of expression of a gene, the boundaries of the exons that make up that isoform, and the expression state that was captured during manufacture of the transcripts that helped define the exons that make up the gene. By knowing these, it is possible to begin to describe the diversity of expressed gene structures, and hence develop an understanding of the biology of gene expression and the relationship between genotype and phenotype. Initial results of comparisons of transcripts to human genome sequence suggest that some exon boundaries show evidence of variation. The variation is not yet well characterised but creates a broader paradigm for the concept of ‘gene’ and also highlights the role of gene expression products in developing understanding of gene structure and expression. As the genome sequences of human and model organisms are completed, it is reasonable to expect annotation of entries to include reference to the expression products of genes. It would be useful to translate genome sequence databases directly into expression products in order to reduce significantly the complexity and time of searching and, in turn, increase sensitivity of the result. Development of a descriptive structure for gene expression products from annotated genome sequences has enormous implications in terms of our understanding of the dynamic interplay between expressed gene products, as it forms the basis upon which a context of expression can be built for each expressed gene. Complete descriptions of transcription products can represent an ‘index’ of an expression state, directly relating the underlying genome sequence to the exons represented c Oxford University Press 2001 Vol. 17 no. 6 2001 Pages 485–486 in the resulting transcripts. The set of descriptions can be for a particular organism, for combined events in that organism’s development or descriptions of expression of that organism’s tissues. The description will require a computer readable format, so that the set of transcribed products during an expression state of any gene can truly be captured, described and understood. The expressed state of genes will increasingly concentrate on available array information, and these in turn will rely on a correct or complete exon-level representation of the gene for which expression is being measured. ENSEMBL already assigns unique accessions to each exon, simplifying description of each isoform, as long as each is described as a set of exon boundaries. But no effort is yet made to link these exons with the expression state (when known) of the transcript that was used to confirm their boundaries. A mature transcription product is made up of transcribed DNA that has been spliced to produce exons in a certain order. The position of a transcription initiation start site, a poly adenylation signal or a splice site signal and (perhaps) its strength, define the boundaries of exons within a gene. The combined set of boundary coordinates for each transcript thus defines the final mature transcription product. Database records such as EMBL, DDBJ and GenBank currently reflect the boundaries of a set of exons by providing the paired locations of the splicesite boundaries, together with the sites of initiation of the first exon and termination of the last exon. Only one such record is usually provided in genomic locus entries. How can records now usefully reflect the diversity of transcript isoforms in context of sets of expressed genes and the various forms found within these sets? A simple first step is to require the genome database community and public database curators to include annotations derived from gene expression. The simplest form of that annotation would be to include in each entry for a genomic gene sequence, the boundary co-ordinates of every exon and the expression states that were used to define the boundaries of the exons described, with each co-ordinate set being given a unique sub-accession. The expression state defines the conditions under which expression of the exon has been observed, be they developmental stage, anatomical location, cell type, etc. The MGED consortium (http://www.mged.org) has been making considerable progress in formal definitions of expression conditions, and controlled vocabularies are beginning to emerge that can be applied to expression states. Each of the unique sub-accessions would represent a particular submission that has been from a particular expression state. A master accession would ideally contain a ‘reference’ exon boundary set. Carefully implemented, such a development will allow the accurate description of any expression product of any 485 W.Hide gene or geneset in any of the known states from which it was characterised. The suggestions outlined above give an indication of the minimum steps that need to be taken by database curators to respond to the challenges of transcript diversity. But should we not really be thinking about formulating a new language to describe expressed gene products in their expression context. . . ExScript? 486 Janet Kelso, Vladimir Babenko, Cathal Seoighe, Tania Hide, Chris Stoekert, Michael Ashburner, Suzanna Lewis and others at the GO Consortium mailing list contributed discussion during the development of these ideas. Win Hide South African National Bioinformatics Institute (SANBI) University of the Western Cape

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ExScript: AN `EX`-CENTRIC APPROACH TO THE DESCRIPTION OF