Download Phenotypic data in FlyBase

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Hardy–Weinberg principle wikipedia , lookup

Genetic engineering wikipedia , lookup

Nutriepigenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Epistasis wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene wikipedia , lookup

Human genetic variation wikipedia , lookup

Genome evolution wikipedia , lookup

Population genetics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genetic drift wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Designer baby wikipedia , lookup

Public health genomics wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Microevolution wikipedia , lookup

Transcript
Rachel Drysdale
has been a genetic curator with
FlyBase since the project began
in 1992. FlyBase is an
international consortium, with
members at Harvard
University, University of
California-Berkeley, University
of Bloomington-Indiana and
University of Cambridge.
Keywords: phenotypic
analysis, genetic analysis,
Drosophila, functional
genomics, database searching,
controlled vocabulary,
literature curation
Phenotypic data in FlyBase
Rachel Drysdale
Date received (in revised form): 3rd November 2000
Abstract
Phenotypic analysis combined with molecular genetics is a powerful tool for mapping gene
function onto the genome. Phenotypic data are, by their nature, descriptive, and as varied as
the range of mutant phenotypes that can be presented by the organism under study. This paper
discusses the mechanisms FlyBase has implemented to systematise published phenotypic data
about Drosophila, and provides an introduction to the query tools available for the mining of
the data. Though FlyBase is speci®c to Drosophila, the issues faced in devising protocols for
capturing, storing and reporting data are the same issues faced by any database with an interest
in using phenotypic data to maximise the potential of genomic analysis.
INTRODUCTION
Rachel Drysdale,
FlyBase (Cambridge),
Department of Genetics,
University of Cambridge,
Downing Street,
Cambridge CB2 3EH, UK
Tel: ‡44 (0) 1223-333963
Fax: ‡44 (0) 1223-333992
E-mail: [email protected]
68
A major challenge for bioinformaticians is
to learn to annotate the output of largescale genome-sequencing projects
accurately and meaningfully. Current
efforts to de®ne gene function are focused
on protein coding regions. Methods of
assigning function have relied on
sequence comparisons of coding regions
and extrapolation of function from a wellde®ned gene to a less-de®ned, but
sequence-related, second gene of the
same or a different organism. Although
this is a powerful technique it falls down
when faced with proteins as yet unrelated
to any other protein, and it does not allow
new functions to be de®ned for proteins
related to, but distinct from, previously
analysed proteins.
Phenotypic analysis of mutant alleles
can extend the understanding of gene
function beyond predicted functions, and
reveals the involvement of gene products
in processes without constraint to lists of
`candidate genes' about which something
is already known. Different alleles, with
distinct molecular lesions in or around the
transcription unit, may have different
mutant phenotypes which inform our
understanding of how that gene and its
product function within the context of
the whole organism. In particular,
phenotypic analysis provides the
opportunity to de®ne regulatory regions
such as complex promoters, enhancers
and silencers, elements that are as yet
dif®cult to identify computationally and
may map many kilobases from the protein
coding region. If we are to learn to use
phenotypic data to exploit computational
genome analysis to its full potential, and
enhance the strength of functional
genomics, we must establish repositories
of phenotypic information that can be
mined and used as starting points for
developing more sophisticated genome
annotation tools. This review will discuss
phenotypic data in FlyBase.
FlyBase (home page shown in Figure 1)
is the principal database resource for
genetic and molecular information about
Drosophila. The primary FlyBase server1 is
at Indiana University and is mirrored
nightly to eight other sites around the
world; see ref. 2 for a complete listing of
mirror sites. The Data Classes covered by
FlyBase are Maps (genomic and genetic),
Genes (gene and allele listings and
descriptions, gene annotations resulting
from the genome sequencing project,
gene product information), Sequences
(genome project resources), Stocks
(mutant strains available from public stock
centres), Transposons (transgene
information), Aberrations (information
about chromosomal rearrangements),
Anatomy and Images (including browsers
for phenotype and gene expression data),
References (the bibliography of
Drosophila publications) and People (an
index of researchers working with
Drosophila). Ancillary sections house
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001
Phenotypic data in FlyBase
the FlyBase service
Figure 1: The FlyBase home page. Hyperlinking is represented by underlining, in this black
and white view
phenotypic analysis
Documents (including the FlyBase
Reference Manual3 ), News, Meeting
listings and cross-links to other, related,
databases. FlyBase has developed a suite of
search tools to enable users to access the
data (see ref. 4 for a complete listing).
Phenotypic data in FlyBase are stored
within the `Genes' data section.
The majority of FlyBase information
pertains to Drosophila melanogaster which
has been an important model organism for
the study of eukaryotic genetics and
biology since the early years of the last
century (for an introductory review see
ref. 5). A combination of genetic utility,
based on the classic tradition of
chromosome mechanics, with the
techniques of molecular biology means
that researchers of D. melanogaster can
custom-design alleles of choice and
incorporate these into the genome for
subsequent analysis as well as study
mutations in the endogenous
chromosomal copy of the gene. The last
decade of the 20th century saw the
notable technical development of multicomponent over-expression systems such
as the GAL4/UAS expression system.6
Using such systems, together with
information from the genome-sequencing
project,7 researchers can now generate
loss of function and/or over-expression
mutant alleles for any identi®ed gene, and
drive its expression in a wide range of
tissues or developmental stages of D.
melanogaster.
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001
69
Drysdale
1999 saw the publication of more than
a thousand papers describing phenotypic
analysis in Drosophila. FlyBase captures,
houses and reports data generated by
phenotypic analysis to enable
straightforward and comprehensive access
to information that may have been
reported in a wide variety of journals
using a variety of nomenclatures.
Phenotypic data are gathered from the
published literature by literature curators.
The curation process uses a system of
attribution such that information arriving
in FlyBase is stored and reported in
association with the article where the
information was published. FlyBase aims
to facilitate the identi®cation of
everything that has been published about
a particular process, mutant phenotype or
gene.
THE STRUCTURE OF
PHENOTYPIC DATA IN
FLYBASE
Phenotypic information is tied
to named, distinct alleles
unique identi®er
numbers
70
Much of the power of phenotypic analysis
lies in the distinction between phenotypes
of different alleles of the same gene. The
association of the distinct phenotypic
characteristics with the distinct molecular
natures of the mutant alleles is far more
informative than a collation of all the
phenotypic data for all the alleles of a gene
into one data table. For example, the
information that a particular allele of
Kettin, Ket 14 , causes both recessive
lethality and dominant ¯ightlessness is
more useful than knowing that different
alleles of Kettin could be recessively lethal
or dominantly ¯ightless. Consequently,
FlyBase strictly partitions phenotype data
about distinct mutant alleles of each gene
into distinct allele records. FlyBase de®nes
genes as objects and gives each a unique
identi®er of the form FBgnxxxxxxx
where x is an integer; for example Ket has
the identi®er FBgn0010396. Alleles are
de®ned as objects with a unique identi®er
of the form FBalxxxxxxx); for example
Ket 14 has the identi®er FBal0009669.
Since the Drosophila ®eld is subject to a
degree of nomenclature anarchy, the
FlyBase curators must track the gene and
allele symbols used in each publication
with respect to previously published gene
and allele symbols for the same entities,
already de®ned as objects within the
FlyBase data tables. Figure 2 shows a
further example, for the Epidermal growth
factor receptor gene, Egfr (FBgn0003731),
which has been independently named in
the literature by several groups. Genes and
alleles are de®ned as objects within the
FlyBase data structure, and assigned
Unique Identi®er Numbers pre®xed with
FBgn for genes and FBal for alleles. An
example using the Epidermal growth
factor receptor gene, Egfr, and three of the
many Egfr alleles, is shown in Figure 2.
One symbol for each gene (Egfr) or allele
(Egfr CA27 , Egfr E1 and Egfr f1 ) is considered
to be valid (larger font) and is used in
reports throughout FlyBase to represent
that gene or allele. Alternative symbols
used in the literature to denote each entity
(such as DER, Elp, ¯b for the gene, and
DER CA27 , Elp E1 and ¯b1F26 for the alleles)
are stored within FlyBase as synonyms
(`synonym' being the Drosophila
community term for an `alias'). Each gene
is related to its alleles in a one-to-many
relationship (denoted by diverging
arrows). Phenotypic data statements are
stored as attributes of each allele. Other
Allele data classes (not shown) are class (for
example amorph, hypomorph,
neomorph), complementation data,
relationship to causative transposon
insertion (important for placing the allele,
and therefore the gene, on the genomic
sequence), relationship to causative
chromosomal rearrangement (important
for placing the allele, and therefore the
gene, on the cytological map), molecular
description of the causative lesion,
relationship to the progenitor
chromosome, and references to the
published literature where that allele is
discussed (see ref. 8 for a full description of
allele data ®elds). Genetic information in
FlyBase is housed in a Sybase relational
database management system.
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001
Phenotypic data in FlyBase
controlled vocabularies
FlyBase relates genes to alleles in a
hierarchical one-gene-to-many-alleles
relationship, encoded in the data structure
(see Figure 2). Having de®ned each allele
as a distinct object within the data tables,
data statements are then captured and
stored as attributes of each allele, with the
relationship to the gene being retained
through the gene±allele relationship.
Phenotypic data in FlyBase are
captured and reported with a combination
of controlled vocabularies and explanatory
free text. The vocabularies are controlled
both with respect to content and syntax,
as described below. The free text
explanations are segregated into distinct
data ®elds. For example, free text about
phenotypes caused by a single mutant
allele is kept partitioned from the
statements about multiple mutant
combinations that reveal genetic
interactions. Within all free text ®elds,
cross-references to other named items in
FlyBase are marked so that these items are
tagged (to facilitate updates when a
symbol changes), and hyperlinked in
reports (to allow the user to traverse
different reports and data sections within
FlyBase easily). The different ®elds are
labelled distinctly in reports to aid the user
in interpreting the data.
Figure 2: Genes, alleles and phenotypic data in FlyBase: the data
structure. Full details of this database schema are available on request
from ref. 9
THE APPLICATION OF
CONTROLLED
VOCABULARIES TO
PHENOTYPIC DATA
The usefulness and manageability of large
bodies of information can be dramatically
enhanced by the use of controlled
vocabularies, or systems of key words, as
an indexing tool. Not only does this
impose a systematisation on the data, so
that commonalities between
fundamentally similar effects are revealed,
even though they may have been
described in different terms in the
literature, but searches performed using
controlled vocabulary terms are more
ef®cient than those acting on phrases from
free text information, and are therefore
more effective. FlyBase applies two
controlled vocabularies to index
phenotypic data:
· The `Phenotypic Class' controlled
vocabulary presents a classi®cation of
phenotype in terms of its pathology or
the effect on the whole organism.
· The `Anatomy' controlled vocabulary
describes the body parts affected in the
mutant phenotype.
Some mutant phenotypes, such as
behavioural phenotypes, lend themselves
to description in terms of Phenotypic
Class, but not (at least not initially) the
anatomical focus of the effect. Other
mutant phenotypes, such as the
developmental mutations that disrupt the
embryo, are described principally in terms
of their mutant Anatomy; the majority are
lethal at the embryonic stage and so to
describe them in terms of this effect on
the organism would be insuf®ciently
discriminating. Applying the two
categorisations of mutant description
means that the vast majority of
phenotypes reported in the literature can
be represented at some level in controlled
terms. The controlled vocabularies are
implemented with a de®ned syntax to
describe, in controlled language, aspects
of the mutant phenotype of each allele,
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001
71
Drysdale
and any genetic interactions it might have
with mutant alleles of another locus.
The Phenotypic Class
controlled vocabulary
The Phenotypic Class controlled
vocabulary includes 81 terms, which
range from the general, such as `lethal',
through more speci®c terms that denote a
certain pathology, such as `tumorigenic',
down to the very speci®c, such as `song
defective'. See ref. 10 for a full listing of
the Phenotypic Class terms.
FlyBase uses quali®ers to add further
explanation to the Phenotypic Classes.
Examples include `dominant',
`conditional temperature sensitive',
`maternal effect', and stage of
development terms, for example
`embryonic', `adult'. A vertical bar
separates the term from the quali®er. See
ref. 11 for a full listing of the Phenotypic
Class quali®ers.
controlled syntax
Example
Goodwin et al.12 wrote about the
behaviour of male ¯ies mutant for the
fru sat allele (FBal0031286) of the fruitless
locus:
mutant phenotypes
fru sat males, when paired with either
another fru sat male or a wild-type
female, exhibited similar low levels of
courtship . . .. Hemizygous fru sa males
showed levels of courtship comparable
to those of fru sa homozygotes . . ..
Despite this paucity of courtship,
suf®cient behaviour by fru sat males was
observed to reveal that the residual
courtship was not normal. During
wing extensions, fru sat males produced
only extremely brief sine-song bouts;
no song pulses were generated . . .
fru sat males, when grouped together,
displayed the typical fruitless behaviour
of courtship chaining . . . fru sat males
were completely sterile when
heterozygous with the deletions fru w24
(n ˆ 20) or P14 (n ˆ 16).
This paragraph has been indexed with
the following terms from the Phenotypic
Class controlled vocabulary:
72
Phenotypic class: viable
Phenotypic class: sterile | recessive
Phenotypic class:
courtship defective | recessive
Phenotypic class:
song defective | recessive
Phenotypic class:
mating defective | recessive
There is no limit to the number of terms
that can be used to describe a mutant
phenotype.
The Anatomy controlled
vocabulary
FlyBase uses a vocabulary of over 5,400
terms, compiled by Michael Ashburner
(FlyBase-Cambridge) drawing on many
sources, to index Anatomy data. See ref.
13 for a full listing of the Anatomy terms.
The terms range from the gross anatomical
(such as `nervous system') through the
single cell level (such as `centripetally
migrating follicle cell') to the subcellular
level (such as `lysosome'). The terms in the
Anatomy vocabulary are related to each
other in three ways, with the relationships
represented by a directed acyclic graph.
There are `part of' relationships, so that the
`tarsus' is part of the `leg'. There are
`instance/collective' relationships; thus
`adult alary muscle 1' is an instance of the
`adult alary muscle' collective term.
Finally, there are progenitor and
descendent relationships: the `aCC
neuron' develops from `neuroblast NB11'. Importantly, the controlled vocabulary
also houses synonyms. So, for example, the
valid term `stomodeal ganglion' has the
following synonyms: pharyngeal ganglion,
oesophageal ganglion, esophageal
ganglion, hypocerebral ganglion and
ventricular ganglion.
The mutant alleles are related to their
anatomical terms.
Example
Bour et al.14 described the phenotype of a
sticks and stones allele, sns A3:24
(FBal0117350):
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001
Phenotypic data in FlyBase
The muscle phenotype of embryos
homozygous for the original sns A3:24
allele includes a large number of
unfused myosin-expressing cells and a
corresponding absence of differentiated
muscle ®bres.
Using the Anatomy controlled
vocabulary, FlyBase recorded:
Phenotype manifest in:
embryonic myoblast
Phenotype manifest in:
embryonic/larval somatic muscle
FlyBase uses quali®ers in combination
with the body part terms, such as `dorsal',
`precursor', `ectopic', which can be used
further to specify anatomical curation. See
ref. 15 for a full listing of these quali®ers.
Example
Cadavid et al.16 described the phenotype
of a liquid facets allele, lqf FDD9
(FBal0104483):
. . . additional photoreceptors in lqf
mutants arise from speci®c precursor
cells (M-cells) present early during eye
development (Fig. 2N). . .
Using the Anatomy controlled vocabulary
FlyBase recorded:
Phenotype manifest in:
photoreceptor cell | ectopic
genome annotation
Having the data indexed in this controlled
way allows users to search FlyBase and
expect to ®nd all that FlyBase has
pertaining to, for example, the `midline
glial cells' or the `chordotonal organs',
which in publications may sometimes
have been referred to as `median glial
cells' or `stretch receptors', respectively.
At the time of writing, 862 anatomical
terms had at least one alternative name;
without the use of the controlled
vocabulary a user would need to know all
possible alternative nomenclatures when
searching FlyBase data for speci®c mutant
phenotypes.
A short note about genetic
engineering
FlyBase considers engineered versions of
genes, generated by in vitro mutagenesis
and assayed in transgenic organisms, or in
transient assays, to be alleles. Examples are
rescue constructs, promoter fusion misexpression constructs, UAS (Upstream
Activating Sequences) directed misexpression constructs and RNA
interference (RNAi) constructs. FlyBase
can thus use the same rigour in capturing
data about the phenotypic analysis of
these transgenes as it does capturing the
analysis of traditional mutant alleles.
FlyBase (September 2000) had 48,434
allele records for D. melanogaster genes, of
which 39,293 were `traditional' and 9,141
were of the `in vitro construct' type.
One powerful application of genetically
engineered alleles is in determining the
amount of genomic DNA within and
around the transcription unit that is
required for full function of the gene. In
these experiments different constructs,
generally including the gene's coding
region but varying extents of ¯anking 59
and 39 or other potentially regulatory
DNA, are assayed for their capacity to
rescue the mutant phenotype for that
gene. The difference between those
constructs that partially rescue and those
that fully rescue a mutant phenotype
provides particularly important
information about the location of those
regulatory sequences responsible for the
®ne tuning required to give full and
proper expression of that gene.
Example
Bloor and Brown17 reported that a
genomic rescue fragment including the
in¯ated transcription unit, represented in
FlyBase by an in vitro mutagenesis allele,
if ‡tBa (FBal0089217), rescues the
amorphic if B4 allele (FBal0039417),
leading to the following statement being
captured for if ‡tBa :
Rescues:if B4
A derivative construct with intronic
sequences removed, but coding region
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001
73
Drysdale
unchanged, represented by FlyBase allele
if minigene (FBal0089180), only partially
rescues if B4 , leading to the following
statement for if minigene :
Partially rescues: if B4
Thus careful attention to details of mutant
phenotypes of allelic combinations can be
translated into genome annotation that
extends beyond the open reading frame,
in this case to an intronic regulatory
sequence for in¯ated.
In an extension of this approach,
FlyBase de®nes `foreign' genes and alleles
to accommodate experiments where an
allele of a gene from a non-drosophilid
has been engineered into Drosophila. For
example the p35 gene of Baculovirus
Autographa californica was introduced into
D. melanogaster to manipulate the cell
death pathway,18 giving rise to a gene
record for BacA\p35 (FBgn0014459) and
associated alleles in FlyBase. (Nondrosophilid genes are ¯agged with a
species pre®x followed by \; see ref. 19
for a complete list.)
Conditional genotypes
genetic interactions
74
The genetic tractability of Drosophila has
encouraged the development of bipartite
mis-expression systems such as the GAL4/
UAS system.6 In this system the
Saccaromyces cerevisiae GAL4 transcription
factor gene (Scer\GAL4:FBgn00014445
in FlyBase) is used to drive the expression
of a gene by putting that gene under the
cis-regulation of the corresponding
regulatory S. cerevisiae UAS. A bank of
Scer\GAL4 driving lines (over 950 as of
September 2000, represented in FlyBase
as alleles of Scer\GAL4) provides a wide
diversity of expression patterns that can be
used to drive the UAS lines. For such
targeted mis-expression experiments, the
phenotypic class or affected body part
depends critically upon the expression
pattern driving the UAS allele. FlyBase
has developed a controlled syntax in order
to guide the user through this complexity;
the relevant genetic background is
recorded (in a hyperlinked form in
reports) within braces, { }, following the
controlled vocabulary term.
Example
When the expression of
Ras85DV12:S35:ScernUAS (FBal0085936,
named by FlyBase for a UAS construct
encoding a constitutively active form of
Ras85D ) is driven in the expression
pattern of dpp by Scer\GAL4 dpp:blk1
(FBal0040480), as reported by Karim and
Rubin,20 the result was hyperplastic wing
discs, giving rise to the following lines in
the Ras85DV12:S35:ScernUAS allele record
and report:
Phenotypic class: hyperplastic
{Scer\GAL4 dpp:blk1 }
Phenotype manifest in: dorsal
mesothoracic disc {Scer\GAL4 dpp:blk1 }
Note that the allele of Scer\GAL4 which
drives the expression of the UAS
construct is recorded within { } braces.
However, when expression of the same
Ras85DV12:S35:ScernUAS allele is driven in
the pattern of sevenless by Scer\GAL4 sev:EP
(FBal0102834), as reported by Therrien et
al.,21 a different phenotype is observed:
Phenotypic class: visible
{Scer\GAL4 sev:EP }
Phenotype manifest in: eye
{Scer\GAL4 sev:EP }
Genetic interactions
A major application of phenotypic analysis
in D. melanogaster is the study of multiple
mutant genotypes in an effort to dissect
genetic interactions and biological
pathways. FlyBase uses the `conditional
genotype' device superimposed on the
Phenotypic Class and Anatomy
vocabularies to represent such data, which
are stored in a data ®eld distinct from
single mutant phenotypic data, and
reported in allele records pre®xed with
the string `Genetic interaction'.
Example
Nagel et al.22 described the partial
suppression of the dominant H2
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001
Phenotypic data in FlyBase
(FBal0005293) wing vein phenotype by
dx 24 (FBal0045558). This observation is
encoded in the following controlled
vocabulary statements:
Functionally complemented by: Ggal\Dach2
For dx 24 :
Functionally complements: dac
Genetic interaction (effect, class):
suppressor | partially, visible { H2 /‡}
Genetic interaction (effect, anatomy):
suppressor | partially, wing vein { H2 /‡}
For H2 :
Genetic interaction (anatomy, effect):
wing vein, suppressible | partially
{dx 24 }
Genetic interaction (class, effect):
visible | dominant,
suppressible | partially {dx 24 }
Thus the effect is evident in the records
for both alleles participating in the
interaction and, because the allele symbols
within the braces, { }, are hyperlinked,
the user can easily move between the two
sides of the interaction to ®nd out more
about the participating alleles. These
interactions are re¯ected at the Gene level
in the Gene Reports by a statement for
H: (FBgn0001169):
Interacts genetically with: dx
and a corresponding statement for dx
(FBgn0000524):
Interacts genetically with: H
A short note about functional
complementation
functional genomics
The FlyBase commitment to capturing
phenotypic data for genetically
engineered alleles facilitates the
recognition of experiments that test
functional equivalence between genes of
Drosophila and other organisms, for
example between chicken (Gallus gallus)
GLI3 and dac (dachshund) mutants of D.
melanogaster. Data captured at the allele
level from Heanue et al.23 led to the
following gene-level statements:
dac (FBgn0005677) is
and conversely that
Ggal\Dach2 (FBgn0029169)
The records for the `foreign' genes in
FlyBase include cross-links to the nondrosophilid gene record in the genome
database speci®c to the non-drosophilid in
question, or the DNA sequence accession
record. For example for chicken Dach2
the cross-links are to EMBL:AF198349.
Having these cross-references stated
explicitly in FlyBase aids the extrapolation
of conclusions about gene function and
homology relationships across species.
Free text can be useful
While controlled vocabularies index
many aspects of the data, FlyBase also
makes use of free text information to
describe further details of mutant
phenotypes. Although free text
constitutes a problem for databases, in that
it proliferates and is dif®cult to manage or
search meaningfully, it is useful to both
database users and curators. For example,
the inclusion of an Anatomy term does
not describe what sort of change has
occurred, simply that the tissue is affected.
A reduced wing, an absent wing or an
unfurled wing will all be indexed with
Phenotype manifest in: wing
and the user might be grateful to see a
description of precisely how the wing is
affected.
Any decision to alter a controlled
vocabulary to re¯ect data more accurately,
and thereby obviate the need for some
free text, must be balanced against the
substantial work incurred `retro®tting'
that change to older data in the database.
Although gratuitous changes are to be
avoided, change must be permitted to
improve the value of the vocabularies, for
example when a new class of phenotype
begins to be assayed in Drosophila. On
these occasions the free text in FlyBase is
invaluable for those curators to whom the
task of retro®tting the new term falls.
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001
75
Drysdale
Phenotypic data are subtle data; the
parameters are dependent upon the assay
system chosen and the scope traverses a
vast range of detail. The archiving of
phenotypic data in databases is in its
infancy. Perhaps, eventually, it will be
possible to describe phenotypes entirely in
controlled terms and controlled syntax.
Perhaps we can develop pan-biological
phenotypic descriptions analogous to the
pan-biological vocabularies being used to
describe protein families. But until we do,
free text will inevitably remain with us.
ACCESSING PHENOTYPIC
DATA
search tools
There are three ways of targeting
investigations about phenotypes within
FlyBase: searching using Allele Search,
browsing using the Expression Summary
and browsing based on the Image Views.
The following examples use results
obtained from the FlyBase server in
September 2000.
The Allele Search route to
phenotypic data
The Allele Search form (see Figure 3) can
be accessed via links from the FlyBase
home page (Figure 1), the Genes data
directory25 and the All Searches page.4
The Allele query form allows `type in'
searches where the ®eld to be searched
can be speci®ed (set to `Symbol/
synonym' and `Body part' in Figure 3).
Since the Anatomy vocabulary (described
above) is large and contains terms with
multiple names the user is provided with a
reference ®le (Body Parts Help) which
lists valid terms and synonyms.
Example
Entering `photoreceptor cell R7' in the
`Body part' ®eld returns a table of over
192 alleles which explicitly include the
statement
Phenotype manifest in: photoreceptor
cell R7
The Allele Search form also provides
scrolling menus for selecting search terms
from the controlled vocabularies used in
classifying allele data, including the
Phenotypic Class vocabulary described
above. Terms can be chosen individually
or in combinations, allowing the user
build appropriate queries.
24
Figure 3: The Allele Search form. The two boxes in the top right
section of the form allow `type in' searches. To the left of these two
boxes are pull-down menus, which limit the data ®eld to be searched. In
this example `Symbol/synonym' and `Body part' have been selected (11
other options are provided, including `Any ®eld'). Scrolling menus (Allele
class, Mutant phenotype class, Mutant phenotype class quali®ers, Origin
of mutant alleles) in the lower panels allow the user to build queries using
de®ned controlled vocabulary terms only. In this example the Phenotypic
Class `paralytic' has been selected
76
Example
A query for `paralytic' alleles (selected in
the `Mutant phenotype class' menu; see
Figure 3) retrieves a hit list of over 90
alleles which include in their record the
statement
Phenotypic class: paralytic
Re®ning this query by selecting the
`recessive' option for `Mutant phenotypic
class quali®er' narrows the results to the
60 or so alleles where the paralysis has
been explicitly described as recessive:
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001
Phenotypic data in FlyBase
Phenotypic class: paralytic | recessive
The scrolling menu query boxes can be
used in conjunction with terms typed into
one or both of the query boxes at the top
of the form to build more speci®c queries.
The Expression Summary route
to phenotypic data
The Expression Summary browsing tool
provides a tabular interface, built on the
Anatomy controlled vocabulary, which
allows the user to navigate through the
Anatomy data to a speci®c tissue or
structure, and then, having found that
structure, return lists of FlyBase objects
that have been indexed with the term
corresponding to that structure.
(Expression pattern data of transcripts,
polypeptides and reporter genes in
FlyBase are expressed using the same
Anatomy controlled vocabulary as is used
for the mutant phenotypes.)
The Expression Summary entry level
table (shown in Figure 4, background) is
linked to the FlyBase home page, via the
`Gene Expression' hyperlink in the
Selected Searches and Tools column of
the Genes data class (see Figure 1), or
from the All Searches page.4 The table
displays (leftmost column, `Term') the top
level terms of the Anatomy controlled
vocabulary (germlayer, tagmata, organ
system (partly shown) and developmental
stage (not shown)). These terms are
hyperlinked to their corresponding
sections further down the Anatomy
hierarchy. The terms representing the
next level in the hierarchy are stated
explicitly, with indentations, in the table.
The columns in the main body of the
table list the numbers of Genes, Mutant
Alleles, Reporters, Transcripts and
Polypeptides which have been indexed
with that term. These hyperlinks lead to
listings of the relevant objects and then to
full reports for each of these objects.
Descending the hierarchy to a more
speci®c level (see Figure 4, foreground),
for example to the expression summary
for the tracheal system, shows more detail
about, in this example, subdivisions of the
tracheal system. Thus 22 Mutant Allele
records include the statement
Phenotype manifest in: embryonic/larval
tracheal pit
and the list of the 22 alleles, and thence
their reports, can be accessed using the
hyperlinked `22' in the Mutant Alleles
column.
The Image Browser route to
phenotypic data
FlyBase has recently begun offering an
image-based route to mutant phenotype
and expression pattern data which, like
the Expression Summary, is based upon
the Anatomy controlled vocabulary.
Selecting `Images'28 from the
`Anatomy' data section of the FlyBase
home page reveals a panel of over 70
thumbnail sketches of the external
anatomy of D. melanogaster adults and
embryos, each hyperlinked to a full page
image, one of which is shown in Figure
5(a). The images are annotated with
relevant terms from the Anatomy
controlled vocabulary, listed beside each
image. These images are interactive;
moving the cursor over an anatomical
term highlights the corresponding
structure on the image, and moving the
cursor over the image highlights the
relevant anatomical term. Each term is
hyperlinked to a `Term Report', an
example of which is shown in Figure 5(b).
The Term Report lists the various
FlyBase objects that have been indexed
with the Term, `embryonic maxillary
segment' in Figure 5(b), and each item in
the list is hyperlinked to the complete
report for that item. In this example, two
mutant alleles, esc2 and esc6 , have
phenotypic data recorded in terms of
`embryonic maxillary segment'. In
addition the Term Report provides a link
to the Expression Summary table (see
Figure 4) for the term in question,
allowing further exploration of the
anatomy data without having to begin
again at a high level.
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001
77
Drysdale
26
Figure 4: Background panel ± The Expression Summary entry page. Foreground panel ±
27
Detail from the Expression Summary for the tracheal system, showing only the top section
of the page and the `All Genes' and `Mutant Alleles' columns. The `Component of' statement at
the top of the page provides a trail through the Anatomy controlled vocabulary to the highest
level term currently on display in the Expression Summary table. Hyperlinking is represented
by underlining, in this black and white view
CONCLUDING REMARKS
Phenotypic data are complex data, but
they have enormous potential to enrich
our understanding of genes and their
functions. During the last quarter of a
century molecular genetic analysis has
revealed commonalities across widely
diverse organisms; we now understand
that we can exploit model organisms to
learn lessons that will be generally
applicable in biology. It is therefore
imperative that we learn to use
phenotypic data to inform the genome
annotation of the model organisms so that
meaningful extrapolations of function can
be made to species where the genetic
78
analysis is less tractable, as is the case for
the human. This paper has described how
FlyBase is addressing this issue, though no
doubt this is only a beginning. In 1921 H.
J. Muller, one of the founders of
Drosophila research, asked `Must we
geneticists become bacteriologists,
physiological chemists and physicists,
simultaneously with being zoologists and
botanists? Let us hope so' (quoted in ref.
31). Were he alive today he would
perhaps add `and bioinformaticians, too'.
Acknowledgements
The phenotypic data in FlyBase is curated at
FlyBase-Cambridge by Gillian Millburn, Chihiro
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001
Phenotypic data in FlyBase
Figure 5: The Image Browser. (a) An example of an image that can be accessed from the
28
29
Image Browser thumbnail sketch page. This example shows a lateral view of an embryo
approximately mid-way through embryogenesis. The cursor has been moved to the
`embryonic maxillary segment' on the image and the corresponding term has consequently
become highlighted in the body parts list to the right of the image. This highlighting of the term
(and the corresponding structure on the image) indicates hyperlinking which can be followed
to the Term Report for that body part. (b) The Term Report for the `embryonic maxillary
30
segment'. The Term Report provides listings of Genes, Alleles, Proteins, Transcripts and
Reporters that have been indexed with the term in question. A link to the Expression
Summary for the term is also provided (see Figure 4 for an example of an Expression
Summary). Not shown, but included in the Term Report, are a series of hyperlinks to other
images annotated with that term, and a statement describing the position of the term in the
Anatomy controlled vocabulary. Hyperlinking is represented by underlining, in this black and
white view
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001
79
Drysdale
Yamada and the author. Phenotypic data
processing and report generation have been carried
out by Aubrey de Grey (FlyBase-Cambridge), Joe
Lemaire and David Emmert (FlyBase-Harvard),
and Don Gilbert and Victor Strelets (FlyBaseIndiana). Don Gilbert and Gary Grumbling
(FlyBase-Indiana) are responsible for the
development of the Expression Summary and
Image Browsing tools. For a full listing of FlyBase
consortium members see ref. 32. FlyBase is
supported by grants from the National Institutes of
Health, USA, and the Medical Research Council,
UK. The author thanks her colleagues at FlyBaseCambridge and Thomas Weaver, Incyte
Genomics, for comments on the manuscript.
References
1.
http://¯ybase.bio.indiana.edu
2.
http://¯ybase.bio.indiana.edu/alt-views/
mirrors/
3.
http://¯ybase.bio.indiana.edu/data/docs/
refman/refman-sections.html
4.
http://¯ybase.bio.indiana.edu/search/
5.
Rubin, G. M. and Lewis, E. B. (2000), `A brief
history of Drosophila's contributions to genome
research', Science, Vol. 287, pp. 2216±2218.
6.
Brand, A. H. and Perrimon, N. (1993),
`Targeted gene expression as a means of
altering cell fates and generating dominant
phenotypes', Development, Vol. 118, pp.
401±415.
7.
Adams, M. D. et al. (2000), `The genome
sequence of Drosophila melanogaster', Science,
Vol. 287(5461), pp. 2185±2195.
8.
http://¯ybase.bio.indiana.edu/data/docs/
refman/refman-B.html#B.1.
9.
¯[email protected]
10. http://¯ybase.bio.indiana.edu/docs/LK/
controlled-vocabularies.txt (section headed
PHENOTYPIC CLASS).
11. http://¯ybase.bio.indiana.edu/docs/LK/
controlled-vocabularies.txt (section headed
PHENOTYPIC CLASS QUALIFIER).
12. Goodwin, S. F., Taylor, B. J., Villella, A. et al.
(2000), `Aberrant splicing and altered spatial
expression patterns in fruitless mutants of
Drosophila melanogaster', Genetics, Vol. 154(2),
pp. 725±745.
13. http://¯ybase.bio.indiana.edu/docs/LK/
controlled-vocabularies.txt (section headed
BODY PART).
14. Bour, B. A., Chakravarti, M., West, J. M. and
Abmayr, S. M. (2000), `Drosophila SNS, a
member of the immunoglobulin superfamily
that is essential for myoblast fusion', Genes
80
Dev., Vol. 14, pp. 1498±1511.
15. http://¯ybase.bio.indiana.edu/docs/LK/
controlled-vocabularies.txt (section headed
BODY PART QUALIFIER).
16. Cadavid, A. L., Ginzel, A. and Fischer, J. A.
(2000), `The function of the Drosophila fat
facets deubiquitinating enzyme in limiting
photoreceptor cell number is intimately
associated with endocytosis', Development, Vol.
127, pp. 1727±1736.
17. Bloor, J. W. and Brown, N. H. (1998),
`Genetic analysis of the Drosophila alphaPS2
integrin subunit reveals discrete adhesive,
morphogenetic and sarcomeric functions',
Genetics, Vol. 148, pp. 1127±1142.
18. Hay, B. A., Wolff, T. and Rubin, G. M.
(1994), `Expression of baculovirus P35
prevents cell death in Drosophila', Development,
Vol. 120, pp. 2121±2129.
19. http://¯ybase.bio.indiana.edu/docs/
nomenclature/LK/species-abbreviations.txt
20. Karim, F. D. and Rubin, G. M. (1998),
`Ectopic expression of activated Ras1 induces
hyperplastic growth and increased cell death in
Drosophila imaginal tissues', Development, Vol.
125, pp. 1±9.
21. Therrien, M., Wong, A. M., Kwan, E. and
Rubin, G. M. (1999), `Functional analysis of
CNK in RAS signaling', Proc. Natl Acad. Sci.
USA, Vol. 96, pp. 13259±13263.
22. Nagel, A. C., Maier, D. and Preiss, A. (2000),
`Su(H)-independent activity of Hairless during
mechano-sensory organ formation in
Drosophila', Mech. Dev., Vol. 94, pp. 3±12.
23. Heanue, T. A., et al. (1999), `Synergistic
regulation of vertebrate muscle development
by Dach2, Eya2, and Six1, homologs of genes
required for Drosophila eye formation', Genes
Dev., Vol. 13, pp. 3231±3243.
24. http://¯ybase.bio.indiana.edu/genes/
fbalquery.hform/
25. http://¯ybase.bio.indiana.edu/genes/
26. http://¯ybase.bio.indiana.edu/.bin/expat/
27. http://¯ybase.bio.indiana.edu/.bin/
cvreport.html?termˆ%22tracheal‡
system%22commandˆexpat
28. http://¯ybase.bio.indiana.edu/.bin/fbimage/
29. http://¯ybase.bio.indiana.edu/.bin/
fbibq.html?FBim2715464
30. http://¯ybase.bio.indiana.edu/.bin/
cvreport.html?%27embryonic‡maxillary‡
segment%27/
31. Carlson, E. A. (1972), `H. J. Muller (1890±
1967)', Genetics, Vol. 70, pp. 1±30.
32. http://¯ybase.bio.indiana.edu/.data/docs/
refman/refman-I.html#I.4
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001