Download ExScript: AN `EX`-CENTRIC APPROACH TO THE DESCRIPTION OF

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomic library wikipedia , lookup

MicroRNA wikipedia , lookup

X-inactivation wikipedia , lookup

Oncogenomics wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Human genetic variation wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Copy-number variation wikipedia , lookup

Non-coding DNA wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Genetic engineering wikipedia , lookup

Epigenetics of depression wikipedia , lookup

Genomics wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Gene therapy wikipedia , lookup

Minimal genome wikipedia , lookup

Ridge (biology) wikipedia , lookup

Transposable element wikipedia , lookup

History of genetic engineering wikipedia , lookup

Human genome wikipedia , lookup

Pathogenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene desert wikipedia , lookup

Primary transcript wikipedia , lookup

Genome (book) wikipedia , lookup

Gene wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epigenetics of human development wikipedia , lookup

NEDD9 wikipedia , lookup

Genome editing wikipedia , lookup

Microevolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Designer baby wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
BIOINFORMATICS
Editorial
ExScript: AN ‘EX’-CENTRIC APPROACH TO
THE DESCRIPTION OF TRANSCRIPT
DIVERSITY
What is the relationship between exon structure and
the diversity of gene expression? Publication of the
human genome sequence has provided fewer genes than
expected. In contrast, transcript to genome comparisons
are beginning to show that at least half of all genes
produce more than one transcript. In addition, we have
been presented with an unexpectedly high diversity of
transcript forms. In order to explore and characterise
diversity in the context of the expression state under which
it has been captured, we need to be able to describe the
variation of gene expression products in a robust manner,
with reference to the structure of the underlying genes and
the state under which the product was expressed.
If we accept that there is significant diversity in expression forms of genes, there is a need to define the isoform
of expression of a gene, the boundaries of the exons that
make up that isoform, and the expression state that was
captured during manufacture of the transcripts that helped
define the exons that make up the gene. By knowing these,
it is possible to begin to describe the diversity of expressed
gene structures, and hence develop an understanding of the
biology of gene expression and the relationship between
genotype and phenotype.
Initial results of comparisons of transcripts to human
genome sequence suggest that some exon boundaries
show evidence of variation. The variation is not yet
well characterised but creates a broader paradigm for the
concept of ‘gene’ and also highlights the role of gene
expression products in developing understanding of gene
structure and expression.
As the genome sequences of human and model organisms are completed, it is reasonable to expect annotation
of entries to include reference to the expression products
of genes. It would be useful to translate genome sequence
databases directly into expression products in order to reduce significantly the complexity and time of searching
and, in turn, increase sensitivity of the result. Development of a descriptive structure for gene expression products from annotated genome sequences has enormous implications in terms of our understanding of the dynamic
interplay between expressed gene products, as it forms the
basis upon which a context of expression can be built for
each expressed gene.
Complete descriptions of transcription products can represent an ‘index’ of an expression state, directly relating
the underlying genome sequence to the exons represented
c Oxford University Press 2001
Vol. 17 no. 6 2001
Pages 485–486
in the resulting transcripts. The set of descriptions can be
for a particular organism, for combined events in that organism’s development or descriptions of expression of that
organism’s tissues. The description will require a computer readable format, so that the set of transcribed products during an expression state of any gene can truly be
captured, described and understood. The expressed state
of genes will increasingly concentrate on available array
information, and these in turn will rely on a correct or
complete exon-level representation of the gene for which
expression is being measured.
ENSEMBL already assigns unique accessions to each
exon, simplifying description of each isoform, as long as
each is described as a set of exon boundaries. But no effort
is yet made to link these exons with the expression state
(when known) of the transcript that was used to confirm
their boundaries. A mature transcription product is made
up of transcribed DNA that has been spliced to produce
exons in a certain order. The position of a transcription
initiation start site, a poly adenylation signal or a splice
site signal and (perhaps) its strength, define the boundaries
of exons within a gene. The combined set of boundary coordinates for each transcript thus defines the final mature
transcription product. Database records such as EMBL,
DDBJ and GenBank currently reflect the boundaries of a
set of exons by providing the paired locations of the splicesite boundaries, together with the sites of initiation of the
first exon and termination of the last exon. Only one such
record is usually provided in genomic locus entries. How
can records now usefully reflect the diversity of transcript
isoforms in context of sets of expressed genes and the
various forms found within these sets?
A simple first step is to require the genome database
community and public database curators to include annotations derived from gene expression. The simplest form
of that annotation would be to include in each entry for
a genomic gene sequence, the boundary co-ordinates of
every exon and the expression states that were used to
define the boundaries of the exons described, with each
co-ordinate set being given a unique sub-accession. The
expression state defines the conditions under which expression of the exon has been observed, be they developmental stage, anatomical location, cell type, etc. The
MGED consortium (http://www.mged.org) has been making considerable progress in formal definitions of expression conditions, and controlled vocabularies are beginning
to emerge that can be applied to expression states. Each
of the unique sub-accessions would represent a particular submission that has been from a particular expression
state. A master accession would ideally contain a ‘reference’ exon boundary set.
Carefully implemented, such a development will allow
the accurate description of any expression product of any
485
W.Hide
gene or geneset in any of the known states from which it
was characterised.
The suggestions outlined above give an indication of the
minimum steps that need to be taken by database curators
to respond to the challenges of transcript diversity. But
should we not really be thinking about formulating a new
language to describe expressed gene products in their
expression context. . . ExScript?
486
Janet Kelso, Vladimir Babenko, Cathal Seoighe, Tania
Hide, Chris Stoekert, Michael Ashburner, Suzanna Lewis
and others at the GO Consortium mailing list contributed
discussion during the development of these ideas.
Win Hide
South African National Bioinformatics Institute (SANBI)
University of the Western Cape