* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Phenotypic data in FlyBase
Hardy–Weinberg principle wikipedia , lookup
Genetic engineering wikipedia , lookup
Nutriepigenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Pathogenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genomic imprinting wikipedia , lookup
Human genetic variation wikipedia , lookup
Genome evolution wikipedia , lookup
Population genetics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genetic drift wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Designer baby wikipedia , lookup
Public health genomics wikipedia , lookup
Genome (book) wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene expression profiling wikipedia , lookup
Rachel Drysdale has been a genetic curator with FlyBase since the project began in 1992. FlyBase is an international consortium, with members at Harvard University, University of California-Berkeley, University of Bloomington-Indiana and University of Cambridge. Keywords: phenotypic analysis, genetic analysis, Drosophila, functional genomics, database searching, controlled vocabulary, literature curation Phenotypic data in FlyBase Rachel Drysdale Date received (in revised form): 3rd November 2000 Abstract Phenotypic analysis combined with molecular genetics is a powerful tool for mapping gene function onto the genome. Phenotypic data are, by their nature, descriptive, and as varied as the range of mutant phenotypes that can be presented by the organism under study. This paper discusses the mechanisms FlyBase has implemented to systematise published phenotypic data about Drosophila, and provides an introduction to the query tools available for the mining of the data. Though FlyBase is speci®c to Drosophila, the issues faced in devising protocols for capturing, storing and reporting data are the same issues faced by any database with an interest in using phenotypic data to maximise the potential of genomic analysis. INTRODUCTION Rachel Drysdale, FlyBase (Cambridge), Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK Tel: 44 (0) 1223-333963 Fax: 44 (0) 1223-333992 E-mail: [email protected] 68 A major challenge for bioinformaticians is to learn to annotate the output of largescale genome-sequencing projects accurately and meaningfully. Current efforts to de®ne gene function are focused on protein coding regions. Methods of assigning function have relied on sequence comparisons of coding regions and extrapolation of function from a wellde®ned gene to a less-de®ned, but sequence-related, second gene of the same or a different organism. Although this is a powerful technique it falls down when faced with proteins as yet unrelated to any other protein, and it does not allow new functions to be de®ned for proteins related to, but distinct from, previously analysed proteins. Phenotypic analysis of mutant alleles can extend the understanding of gene function beyond predicted functions, and reveals the involvement of gene products in processes without constraint to lists of `candidate genes' about which something is already known. Different alleles, with distinct molecular lesions in or around the transcription unit, may have different mutant phenotypes which inform our understanding of how that gene and its product function within the context of the whole organism. In particular, phenotypic analysis provides the opportunity to de®ne regulatory regions such as complex promoters, enhancers and silencers, elements that are as yet dif®cult to identify computationally and may map many kilobases from the protein coding region. If we are to learn to use phenotypic data to exploit computational genome analysis to its full potential, and enhance the strength of functional genomics, we must establish repositories of phenotypic information that can be mined and used as starting points for developing more sophisticated genome annotation tools. This review will discuss phenotypic data in FlyBase. FlyBase (home page shown in Figure 1) is the principal database resource for genetic and molecular information about Drosophila. The primary FlyBase server1 is at Indiana University and is mirrored nightly to eight other sites around the world; see ref. 2 for a complete listing of mirror sites. The Data Classes covered by FlyBase are Maps (genomic and genetic), Genes (gene and allele listings and descriptions, gene annotations resulting from the genome sequencing project, gene product information), Sequences (genome project resources), Stocks (mutant strains available from public stock centres), Transposons (transgene information), Aberrations (information about chromosomal rearrangements), Anatomy and Images (including browsers for phenotype and gene expression data), References (the bibliography of Drosophila publications) and People (an index of researchers working with Drosophila). Ancillary sections house & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001 Phenotypic data in FlyBase the FlyBase service Figure 1: The FlyBase home page. Hyperlinking is represented by underlining, in this black and white view phenotypic analysis Documents (including the FlyBase Reference Manual3 ), News, Meeting listings and cross-links to other, related, databases. FlyBase has developed a suite of search tools to enable users to access the data (see ref. 4 for a complete listing). Phenotypic data in FlyBase are stored within the `Genes' data section. The majority of FlyBase information pertains to Drosophila melanogaster which has been an important model organism for the study of eukaryotic genetics and biology since the early years of the last century (for an introductory review see ref. 5). A combination of genetic utility, based on the classic tradition of chromosome mechanics, with the techniques of molecular biology means that researchers of D. melanogaster can custom-design alleles of choice and incorporate these into the genome for subsequent analysis as well as study mutations in the endogenous chromosomal copy of the gene. The last decade of the 20th century saw the notable technical development of multicomponent over-expression systems such as the GAL4/UAS expression system.6 Using such systems, together with information from the genome-sequencing project,7 researchers can now generate loss of function and/or over-expression mutant alleles for any identi®ed gene, and drive its expression in a wide range of tissues or developmental stages of D. melanogaster. & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001 69 Drysdale 1999 saw the publication of more than a thousand papers describing phenotypic analysis in Drosophila. FlyBase captures, houses and reports data generated by phenotypic analysis to enable straightforward and comprehensive access to information that may have been reported in a wide variety of journals using a variety of nomenclatures. Phenotypic data are gathered from the published literature by literature curators. The curation process uses a system of attribution such that information arriving in FlyBase is stored and reported in association with the article where the information was published. FlyBase aims to facilitate the identi®cation of everything that has been published about a particular process, mutant phenotype or gene. THE STRUCTURE OF PHENOTYPIC DATA IN FLYBASE Phenotypic information is tied to named, distinct alleles unique identi®er numbers 70 Much of the power of phenotypic analysis lies in the distinction between phenotypes of different alleles of the same gene. The association of the distinct phenotypic characteristics with the distinct molecular natures of the mutant alleles is far more informative than a collation of all the phenotypic data for all the alleles of a gene into one data table. For example, the information that a particular allele of Kettin, Ket 14 , causes both recessive lethality and dominant ¯ightlessness is more useful than knowing that different alleles of Kettin could be recessively lethal or dominantly ¯ightless. Consequently, FlyBase strictly partitions phenotype data about distinct mutant alleles of each gene into distinct allele records. FlyBase de®nes genes as objects and gives each a unique identi®er of the form FBgnxxxxxxx where x is an integer; for example Ket has the identi®er FBgn0010396. Alleles are de®ned as objects with a unique identi®er of the form FBalxxxxxxx); for example Ket 14 has the identi®er FBal0009669. Since the Drosophila ®eld is subject to a degree of nomenclature anarchy, the FlyBase curators must track the gene and allele symbols used in each publication with respect to previously published gene and allele symbols for the same entities, already de®ned as objects within the FlyBase data tables. Figure 2 shows a further example, for the Epidermal growth factor receptor gene, Egfr (FBgn0003731), which has been independently named in the literature by several groups. Genes and alleles are de®ned as objects within the FlyBase data structure, and assigned Unique Identi®er Numbers pre®xed with FBgn for genes and FBal for alleles. An example using the Epidermal growth factor receptor gene, Egfr, and three of the many Egfr alleles, is shown in Figure 2. One symbol for each gene (Egfr) or allele (Egfr CA27 , Egfr E1 and Egfr f1 ) is considered to be valid (larger font) and is used in reports throughout FlyBase to represent that gene or allele. Alternative symbols used in the literature to denote each entity (such as DER, Elp, ¯b for the gene, and DER CA27 , Elp E1 and ¯b1F26 for the alleles) are stored within FlyBase as synonyms (`synonym' being the Drosophila community term for an `alias'). Each gene is related to its alleles in a one-to-many relationship (denoted by diverging arrows). Phenotypic data statements are stored as attributes of each allele. Other Allele data classes (not shown) are class (for example amorph, hypomorph, neomorph), complementation data, relationship to causative transposon insertion (important for placing the allele, and therefore the gene, on the genomic sequence), relationship to causative chromosomal rearrangement (important for placing the allele, and therefore the gene, on the cytological map), molecular description of the causative lesion, relationship to the progenitor chromosome, and references to the published literature where that allele is discussed (see ref. 8 for a full description of allele data ®elds). Genetic information in FlyBase is housed in a Sybase relational database management system. & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001 Phenotypic data in FlyBase controlled vocabularies FlyBase relates genes to alleles in a hierarchical one-gene-to-many-alleles relationship, encoded in the data structure (see Figure 2). Having de®ned each allele as a distinct object within the data tables, data statements are then captured and stored as attributes of each allele, with the relationship to the gene being retained through the gene±allele relationship. Phenotypic data in FlyBase are captured and reported with a combination of controlled vocabularies and explanatory free text. The vocabularies are controlled both with respect to content and syntax, as described below. The free text explanations are segregated into distinct data ®elds. For example, free text about phenotypes caused by a single mutant allele is kept partitioned from the statements about multiple mutant combinations that reveal genetic interactions. Within all free text ®elds, cross-references to other named items in FlyBase are marked so that these items are tagged (to facilitate updates when a symbol changes), and hyperlinked in reports (to allow the user to traverse different reports and data sections within FlyBase easily). The different ®elds are labelled distinctly in reports to aid the user in interpreting the data. Figure 2: Genes, alleles and phenotypic data in FlyBase: the data structure. Full details of this database schema are available on request from ref. 9 THE APPLICATION OF CONTROLLED VOCABULARIES TO PHENOTYPIC DATA The usefulness and manageability of large bodies of information can be dramatically enhanced by the use of controlled vocabularies, or systems of key words, as an indexing tool. Not only does this impose a systematisation on the data, so that commonalities between fundamentally similar effects are revealed, even though they may have been described in different terms in the literature, but searches performed using controlled vocabulary terms are more ef®cient than those acting on phrases from free text information, and are therefore more effective. FlyBase applies two controlled vocabularies to index phenotypic data: · The `Phenotypic Class' controlled vocabulary presents a classi®cation of phenotype in terms of its pathology or the effect on the whole organism. · The `Anatomy' controlled vocabulary describes the body parts affected in the mutant phenotype. Some mutant phenotypes, such as behavioural phenotypes, lend themselves to description in terms of Phenotypic Class, but not (at least not initially) the anatomical focus of the effect. Other mutant phenotypes, such as the developmental mutations that disrupt the embryo, are described principally in terms of their mutant Anatomy; the majority are lethal at the embryonic stage and so to describe them in terms of this effect on the organism would be insuf®ciently discriminating. Applying the two categorisations of mutant description means that the vast majority of phenotypes reported in the literature can be represented at some level in controlled terms. The controlled vocabularies are implemented with a de®ned syntax to describe, in controlled language, aspects of the mutant phenotype of each allele, & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001 71 Drysdale and any genetic interactions it might have with mutant alleles of another locus. The Phenotypic Class controlled vocabulary The Phenotypic Class controlled vocabulary includes 81 terms, which range from the general, such as `lethal', through more speci®c terms that denote a certain pathology, such as `tumorigenic', down to the very speci®c, such as `song defective'. See ref. 10 for a full listing of the Phenotypic Class terms. FlyBase uses quali®ers to add further explanation to the Phenotypic Classes. Examples include `dominant', `conditional temperature sensitive', `maternal effect', and stage of development terms, for example `embryonic', `adult'. A vertical bar separates the term from the quali®er. See ref. 11 for a full listing of the Phenotypic Class quali®ers. controlled syntax Example Goodwin et al.12 wrote about the behaviour of male ¯ies mutant for the fru sat allele (FBal0031286) of the fruitless locus: mutant phenotypes fru sat males, when paired with either another fru sat male or a wild-type female, exhibited similar low levels of courtship . . .. Hemizygous fru sa males showed levels of courtship comparable to those of fru sa homozygotes . . .. Despite this paucity of courtship, suf®cient behaviour by fru sat males was observed to reveal that the residual courtship was not normal. During wing extensions, fru sat males produced only extremely brief sine-song bouts; no song pulses were generated . . . fru sat males, when grouped together, displayed the typical fruitless behaviour of courtship chaining . . . fru sat males were completely sterile when heterozygous with the deletions fru w24 (n 20) or P14 (n 16). This paragraph has been indexed with the following terms from the Phenotypic Class controlled vocabulary: 72 Phenotypic class: viable Phenotypic class: sterile | recessive Phenotypic class: courtship defective | recessive Phenotypic class: song defective | recessive Phenotypic class: mating defective | recessive There is no limit to the number of terms that can be used to describe a mutant phenotype. The Anatomy controlled vocabulary FlyBase uses a vocabulary of over 5,400 terms, compiled by Michael Ashburner (FlyBase-Cambridge) drawing on many sources, to index Anatomy data. See ref. 13 for a full listing of the Anatomy terms. The terms range from the gross anatomical (such as `nervous system') through the single cell level (such as `centripetally migrating follicle cell') to the subcellular level (such as `lysosome'). The terms in the Anatomy vocabulary are related to each other in three ways, with the relationships represented by a directed acyclic graph. There are `part of' relationships, so that the `tarsus' is part of the `leg'. There are `instance/collective' relationships; thus `adult alary muscle 1' is an instance of the `adult alary muscle' collective term. Finally, there are progenitor and descendent relationships: the `aCC neuron' develops from `neuroblast NB11'. Importantly, the controlled vocabulary also houses synonyms. So, for example, the valid term `stomodeal ganglion' has the following synonyms: pharyngeal ganglion, oesophageal ganglion, esophageal ganglion, hypocerebral ganglion and ventricular ganglion. The mutant alleles are related to their anatomical terms. Example Bour et al.14 described the phenotype of a sticks and stones allele, sns A3:24 (FBal0117350): & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001 Phenotypic data in FlyBase The muscle phenotype of embryos homozygous for the original sns A3:24 allele includes a large number of unfused myosin-expressing cells and a corresponding absence of differentiated muscle ®bres. Using the Anatomy controlled vocabulary, FlyBase recorded: Phenotype manifest in: embryonic myoblast Phenotype manifest in: embryonic/larval somatic muscle FlyBase uses quali®ers in combination with the body part terms, such as `dorsal', `precursor', `ectopic', which can be used further to specify anatomical curation. See ref. 15 for a full listing of these quali®ers. Example Cadavid et al.16 described the phenotype of a liquid facets allele, lqf FDD9 (FBal0104483): . . . additional photoreceptors in lqf mutants arise from speci®c precursor cells (M-cells) present early during eye development (Fig. 2N). . . Using the Anatomy controlled vocabulary FlyBase recorded: Phenotype manifest in: photoreceptor cell | ectopic genome annotation Having the data indexed in this controlled way allows users to search FlyBase and expect to ®nd all that FlyBase has pertaining to, for example, the `midline glial cells' or the `chordotonal organs', which in publications may sometimes have been referred to as `median glial cells' or `stretch receptors', respectively. At the time of writing, 862 anatomical terms had at least one alternative name; without the use of the controlled vocabulary a user would need to know all possible alternative nomenclatures when searching FlyBase data for speci®c mutant phenotypes. A short note about genetic engineering FlyBase considers engineered versions of genes, generated by in vitro mutagenesis and assayed in transgenic organisms, or in transient assays, to be alleles. Examples are rescue constructs, promoter fusion misexpression constructs, UAS (Upstream Activating Sequences) directed misexpression constructs and RNA interference (RNAi) constructs. FlyBase can thus use the same rigour in capturing data about the phenotypic analysis of these transgenes as it does capturing the analysis of traditional mutant alleles. FlyBase (September 2000) had 48,434 allele records for D. melanogaster genes, of which 39,293 were `traditional' and 9,141 were of the `in vitro construct' type. One powerful application of genetically engineered alleles is in determining the amount of genomic DNA within and around the transcription unit that is required for full function of the gene. In these experiments different constructs, generally including the gene's coding region but varying extents of ¯anking 59 and 39 or other potentially regulatory DNA, are assayed for their capacity to rescue the mutant phenotype for that gene. The difference between those constructs that partially rescue and those that fully rescue a mutant phenotype provides particularly important information about the location of those regulatory sequences responsible for the ®ne tuning required to give full and proper expression of that gene. Example Bloor and Brown17 reported that a genomic rescue fragment including the in¯ated transcription unit, represented in FlyBase by an in vitro mutagenesis allele, if tBa (FBal0089217), rescues the amorphic if B4 allele (FBal0039417), leading to the following statement being captured for if tBa : Rescues:if B4 A derivative construct with intronic sequences removed, but coding region & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001 73 Drysdale unchanged, represented by FlyBase allele if minigene (FBal0089180), only partially rescues if B4 , leading to the following statement for if minigene : Partially rescues: if B4 Thus careful attention to details of mutant phenotypes of allelic combinations can be translated into genome annotation that extends beyond the open reading frame, in this case to an intronic regulatory sequence for in¯ated. In an extension of this approach, FlyBase de®nes `foreign' genes and alleles to accommodate experiments where an allele of a gene from a non-drosophilid has been engineered into Drosophila. For example the p35 gene of Baculovirus Autographa californica was introduced into D. melanogaster to manipulate the cell death pathway,18 giving rise to a gene record for BacA\p35 (FBgn0014459) and associated alleles in FlyBase. (Nondrosophilid genes are ¯agged with a species pre®x followed by \; see ref. 19 for a complete list.) Conditional genotypes genetic interactions 74 The genetic tractability of Drosophila has encouraged the development of bipartite mis-expression systems such as the GAL4/ UAS system.6 In this system the Saccaromyces cerevisiae GAL4 transcription factor gene (Scer\GAL4:FBgn00014445 in FlyBase) is used to drive the expression of a gene by putting that gene under the cis-regulation of the corresponding regulatory S. cerevisiae UAS. A bank of Scer\GAL4 driving lines (over 950 as of September 2000, represented in FlyBase as alleles of Scer\GAL4) provides a wide diversity of expression patterns that can be used to drive the UAS lines. For such targeted mis-expression experiments, the phenotypic class or affected body part depends critically upon the expression pattern driving the UAS allele. FlyBase has developed a controlled syntax in order to guide the user through this complexity; the relevant genetic background is recorded (in a hyperlinked form in reports) within braces, { }, following the controlled vocabulary term. Example When the expression of Ras85DV12:S35:ScernUAS (FBal0085936, named by FlyBase for a UAS construct encoding a constitutively active form of Ras85D ) is driven in the expression pattern of dpp by Scer\GAL4 dpp:blk1 (FBal0040480), as reported by Karim and Rubin,20 the result was hyperplastic wing discs, giving rise to the following lines in the Ras85DV12:S35:ScernUAS allele record and report: Phenotypic class: hyperplastic {Scer\GAL4 dpp:blk1 } Phenotype manifest in: dorsal mesothoracic disc {Scer\GAL4 dpp:blk1 } Note that the allele of Scer\GAL4 which drives the expression of the UAS construct is recorded within { } braces. However, when expression of the same Ras85DV12:S35:ScernUAS allele is driven in the pattern of sevenless by Scer\GAL4 sev:EP (FBal0102834), as reported by Therrien et al.,21 a different phenotype is observed: Phenotypic class: visible {Scer\GAL4 sev:EP } Phenotype manifest in: eye {Scer\GAL4 sev:EP } Genetic interactions A major application of phenotypic analysis in D. melanogaster is the study of multiple mutant genotypes in an effort to dissect genetic interactions and biological pathways. FlyBase uses the `conditional genotype' device superimposed on the Phenotypic Class and Anatomy vocabularies to represent such data, which are stored in a data ®eld distinct from single mutant phenotypic data, and reported in allele records pre®xed with the string `Genetic interaction'. Example Nagel et al.22 described the partial suppression of the dominant H2 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001 Phenotypic data in FlyBase (FBal0005293) wing vein phenotype by dx 24 (FBal0045558). This observation is encoded in the following controlled vocabulary statements: Functionally complemented by: Ggal\Dach2 For dx 24 : Functionally complements: dac Genetic interaction (effect, class): suppressor | partially, visible { H2 /} Genetic interaction (effect, anatomy): suppressor | partially, wing vein { H2 /} For H2 : Genetic interaction (anatomy, effect): wing vein, suppressible | partially {dx 24 } Genetic interaction (class, effect): visible | dominant, suppressible | partially {dx 24 } Thus the effect is evident in the records for both alleles participating in the interaction and, because the allele symbols within the braces, { }, are hyperlinked, the user can easily move between the two sides of the interaction to ®nd out more about the participating alleles. These interactions are re¯ected at the Gene level in the Gene Reports by a statement for H: (FBgn0001169): Interacts genetically with: dx and a corresponding statement for dx (FBgn0000524): Interacts genetically with: H A short note about functional complementation functional genomics The FlyBase commitment to capturing phenotypic data for genetically engineered alleles facilitates the recognition of experiments that test functional equivalence between genes of Drosophila and other organisms, for example between chicken (Gallus gallus) GLI3 and dac (dachshund) mutants of D. melanogaster. Data captured at the allele level from Heanue et al.23 led to the following gene-level statements: dac (FBgn0005677) is and conversely that Ggal\Dach2 (FBgn0029169) The records for the `foreign' genes in FlyBase include cross-links to the nondrosophilid gene record in the genome database speci®c to the non-drosophilid in question, or the DNA sequence accession record. For example for chicken Dach2 the cross-links are to EMBL:AF198349. Having these cross-references stated explicitly in FlyBase aids the extrapolation of conclusions about gene function and homology relationships across species. Free text can be useful While controlled vocabularies index many aspects of the data, FlyBase also makes use of free text information to describe further details of mutant phenotypes. Although free text constitutes a problem for databases, in that it proliferates and is dif®cult to manage or search meaningfully, it is useful to both database users and curators. For example, the inclusion of an Anatomy term does not describe what sort of change has occurred, simply that the tissue is affected. A reduced wing, an absent wing or an unfurled wing will all be indexed with Phenotype manifest in: wing and the user might be grateful to see a description of precisely how the wing is affected. Any decision to alter a controlled vocabulary to re¯ect data more accurately, and thereby obviate the need for some free text, must be balanced against the substantial work incurred `retro®tting' that change to older data in the database. Although gratuitous changes are to be avoided, change must be permitted to improve the value of the vocabularies, for example when a new class of phenotype begins to be assayed in Drosophila. On these occasions the free text in FlyBase is invaluable for those curators to whom the task of retro®tting the new term falls. & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001 75 Drysdale Phenotypic data are subtle data; the parameters are dependent upon the assay system chosen and the scope traverses a vast range of detail. The archiving of phenotypic data in databases is in its infancy. Perhaps, eventually, it will be possible to describe phenotypes entirely in controlled terms and controlled syntax. Perhaps we can develop pan-biological phenotypic descriptions analogous to the pan-biological vocabularies being used to describe protein families. But until we do, free text will inevitably remain with us. ACCESSING PHENOTYPIC DATA search tools There are three ways of targeting investigations about phenotypes within FlyBase: searching using Allele Search, browsing using the Expression Summary and browsing based on the Image Views. The following examples use results obtained from the FlyBase server in September 2000. The Allele Search route to phenotypic data The Allele Search form (see Figure 3) can be accessed via links from the FlyBase home page (Figure 1), the Genes data directory25 and the All Searches page.4 The Allele query form allows `type in' searches where the ®eld to be searched can be speci®ed (set to `Symbol/ synonym' and `Body part' in Figure 3). Since the Anatomy vocabulary (described above) is large and contains terms with multiple names the user is provided with a reference ®le (Body Parts Help) which lists valid terms and synonyms. Example Entering `photoreceptor cell R7' in the `Body part' ®eld returns a table of over 192 alleles which explicitly include the statement Phenotype manifest in: photoreceptor cell R7 The Allele Search form also provides scrolling menus for selecting search terms from the controlled vocabularies used in classifying allele data, including the Phenotypic Class vocabulary described above. Terms can be chosen individually or in combinations, allowing the user build appropriate queries. 24 Figure 3: The Allele Search form. The two boxes in the top right section of the form allow `type in' searches. To the left of these two boxes are pull-down menus, which limit the data ®eld to be searched. In this example `Symbol/synonym' and `Body part' have been selected (11 other options are provided, including `Any ®eld'). Scrolling menus (Allele class, Mutant phenotype class, Mutant phenotype class quali®ers, Origin of mutant alleles) in the lower panels allow the user to build queries using de®ned controlled vocabulary terms only. In this example the Phenotypic Class `paralytic' has been selected 76 Example A query for `paralytic' alleles (selected in the `Mutant phenotype class' menu; see Figure 3) retrieves a hit list of over 90 alleles which include in their record the statement Phenotypic class: paralytic Re®ning this query by selecting the `recessive' option for `Mutant phenotypic class quali®er' narrows the results to the 60 or so alleles where the paralysis has been explicitly described as recessive: & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001 Phenotypic data in FlyBase Phenotypic class: paralytic | recessive The scrolling menu query boxes can be used in conjunction with terms typed into one or both of the query boxes at the top of the form to build more speci®c queries. The Expression Summary route to phenotypic data The Expression Summary browsing tool provides a tabular interface, built on the Anatomy controlled vocabulary, which allows the user to navigate through the Anatomy data to a speci®c tissue or structure, and then, having found that structure, return lists of FlyBase objects that have been indexed with the term corresponding to that structure. (Expression pattern data of transcripts, polypeptides and reporter genes in FlyBase are expressed using the same Anatomy controlled vocabulary as is used for the mutant phenotypes.) The Expression Summary entry level table (shown in Figure 4, background) is linked to the FlyBase home page, via the `Gene Expression' hyperlink in the Selected Searches and Tools column of the Genes data class (see Figure 1), or from the All Searches page.4 The table displays (leftmost column, `Term') the top level terms of the Anatomy controlled vocabulary (germlayer, tagmata, organ system (partly shown) and developmental stage (not shown)). These terms are hyperlinked to their corresponding sections further down the Anatomy hierarchy. The terms representing the next level in the hierarchy are stated explicitly, with indentations, in the table. The columns in the main body of the table list the numbers of Genes, Mutant Alleles, Reporters, Transcripts and Polypeptides which have been indexed with that term. These hyperlinks lead to listings of the relevant objects and then to full reports for each of these objects. Descending the hierarchy to a more speci®c level (see Figure 4, foreground), for example to the expression summary for the tracheal system, shows more detail about, in this example, subdivisions of the tracheal system. Thus 22 Mutant Allele records include the statement Phenotype manifest in: embryonic/larval tracheal pit and the list of the 22 alleles, and thence their reports, can be accessed using the hyperlinked `22' in the Mutant Alleles column. The Image Browser route to phenotypic data FlyBase has recently begun offering an image-based route to mutant phenotype and expression pattern data which, like the Expression Summary, is based upon the Anatomy controlled vocabulary. Selecting `Images'28 from the `Anatomy' data section of the FlyBase home page reveals a panel of over 70 thumbnail sketches of the external anatomy of D. melanogaster adults and embryos, each hyperlinked to a full page image, one of which is shown in Figure 5(a). The images are annotated with relevant terms from the Anatomy controlled vocabulary, listed beside each image. These images are interactive; moving the cursor over an anatomical term highlights the corresponding structure on the image, and moving the cursor over the image highlights the relevant anatomical term. Each term is hyperlinked to a `Term Report', an example of which is shown in Figure 5(b). The Term Report lists the various FlyBase objects that have been indexed with the Term, `embryonic maxillary segment' in Figure 5(b), and each item in the list is hyperlinked to the complete report for that item. In this example, two mutant alleles, esc2 and esc6 , have phenotypic data recorded in terms of `embryonic maxillary segment'. In addition the Term Report provides a link to the Expression Summary table (see Figure 4) for the term in question, allowing further exploration of the anatomy data without having to begin again at a high level. & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001 77 Drysdale 26 Figure 4: Background panel ± The Expression Summary entry page. Foreground panel ± 27 Detail from the Expression Summary for the tracheal system, showing only the top section of the page and the `All Genes' and `Mutant Alleles' columns. The `Component of' statement at the top of the page provides a trail through the Anatomy controlled vocabulary to the highest level term currently on display in the Expression Summary table. Hyperlinking is represented by underlining, in this black and white view CONCLUDING REMARKS Phenotypic data are complex data, but they have enormous potential to enrich our understanding of genes and their functions. During the last quarter of a century molecular genetic analysis has revealed commonalities across widely diverse organisms; we now understand that we can exploit model organisms to learn lessons that will be generally applicable in biology. It is therefore imperative that we learn to use phenotypic data to inform the genome annotation of the model organisms so that meaningful extrapolations of function can be made to species where the genetic 78 analysis is less tractable, as is the case for the human. This paper has described how FlyBase is addressing this issue, though no doubt this is only a beginning. In 1921 H. J. Muller, one of the founders of Drosophila research, asked `Must we geneticists become bacteriologists, physiological chemists and physicists, simultaneously with being zoologists and botanists? Let us hope so' (quoted in ref. 31). Were he alive today he would perhaps add `and bioinformaticians, too'. Acknowledgements The phenotypic data in FlyBase is curated at FlyBase-Cambridge by Gillian Millburn, Chihiro & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001 Phenotypic data in FlyBase Figure 5: The Image Browser. (a) An example of an image that can be accessed from the 28 29 Image Browser thumbnail sketch page. This example shows a lateral view of an embryo approximately mid-way through embryogenesis. The cursor has been moved to the `embryonic maxillary segment' on the image and the corresponding term has consequently become highlighted in the body parts list to the right of the image. This highlighting of the term (and the corresponding structure on the image) indicates hyperlinking which can be followed to the Term Report for that body part. (b) The Term Report for the `embryonic maxillary 30 segment'. The Term Report provides listings of Genes, Alleles, Proteins, Transcripts and Reporters that have been indexed with the term in question. A link to the Expression Summary for the term is also provided (see Figure 4 for an example of an Expression Summary). Not shown, but included in the Term Report, are a series of hyperlinks to other images annotated with that term, and a statement describing the position of the term in the Anatomy controlled vocabulary. Hyperlinking is represented by underlining, in this black and white view & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001 79 Drysdale Yamada and the author. Phenotypic data processing and report generation have been carried out by Aubrey de Grey (FlyBase-Cambridge), Joe Lemaire and David Emmert (FlyBase-Harvard), and Don Gilbert and Victor Strelets (FlyBaseIndiana). Don Gilbert and Gary Grumbling (FlyBase-Indiana) are responsible for the development of the Expression Summary and Image Browsing tools. For a full listing of FlyBase consortium members see ref. 32. FlyBase is supported by grants from the National Institutes of Health, USA, and the Medical Research Council, UK. The author thanks her colleagues at FlyBaseCambridge and Thomas Weaver, Incyte Genomics, for comments on the manuscript. References 1. http://¯ybase.bio.indiana.edu 2. http://¯ybase.bio.indiana.edu/alt-views/ mirrors/ 3. http://¯ybase.bio.indiana.edu/data/docs/ refman/refman-sections.html 4. http://¯ybase.bio.indiana.edu/search/ 5. Rubin, G. M. and Lewis, E. B. (2000), `A brief history of Drosophila's contributions to genome research', Science, Vol. 287, pp. 2216±2218. 6. Brand, A. H. and Perrimon, N. (1993), `Targeted gene expression as a means of altering cell fates and generating dominant phenotypes', Development, Vol. 118, pp. 401±415. 7. Adams, M. D. et al. (2000), `The genome sequence of Drosophila melanogaster', Science, Vol. 287(5461), pp. 2185±2195. 8. http://¯ybase.bio.indiana.edu/data/docs/ refman/refman-B.html#B.1. 9. ¯[email protected] 10. http://¯ybase.bio.indiana.edu/docs/LK/ controlled-vocabularies.txt (section headed PHENOTYPIC CLASS). 11. http://¯ybase.bio.indiana.edu/docs/LK/ controlled-vocabularies.txt (section headed PHENOTYPIC CLASS QUALIFIER). 12. Goodwin, S. F., Taylor, B. J., Villella, A. et al. (2000), `Aberrant splicing and altered spatial expression patterns in fruitless mutants of Drosophila melanogaster', Genetics, Vol. 154(2), pp. 725±745. 13. http://¯ybase.bio.indiana.edu/docs/LK/ controlled-vocabularies.txt (section headed BODY PART). 14. Bour, B. A., Chakravarti, M., West, J. M. and Abmayr, S. M. (2000), `Drosophila SNS, a member of the immunoglobulin superfamily that is essential for myoblast fusion', Genes 80 Dev., Vol. 14, pp. 1498±1511. 15. http://¯ybase.bio.indiana.edu/docs/LK/ controlled-vocabularies.txt (section headed BODY PART QUALIFIER). 16. Cadavid, A. L., Ginzel, A. and Fischer, J. A. (2000), `The function of the Drosophila fat facets deubiquitinating enzyme in limiting photoreceptor cell number is intimately associated with endocytosis', Development, Vol. 127, pp. 1727±1736. 17. Bloor, J. W. and Brown, N. H. (1998), `Genetic analysis of the Drosophila alphaPS2 integrin subunit reveals discrete adhesive, morphogenetic and sarcomeric functions', Genetics, Vol. 148, pp. 1127±1142. 18. Hay, B. A., Wolff, T. and Rubin, G. M. (1994), `Expression of baculovirus P35 prevents cell death in Drosophila', Development, Vol. 120, pp. 2121±2129. 19. http://¯ybase.bio.indiana.edu/docs/ nomenclature/LK/species-abbreviations.txt 20. Karim, F. D. and Rubin, G. M. (1998), `Ectopic expression of activated Ras1 induces hyperplastic growth and increased cell death in Drosophila imaginal tissues', Development, Vol. 125, pp. 1±9. 21. Therrien, M., Wong, A. M., Kwan, E. and Rubin, G. M. (1999), `Functional analysis of CNK in RAS signaling', Proc. Natl Acad. Sci. USA, Vol. 96, pp. 13259±13263. 22. Nagel, A. C., Maier, D. and Preiss, A. (2000), `Su(H)-independent activity of Hairless during mechano-sensory organ formation in Drosophila', Mech. Dev., Vol. 94, pp. 3±12. 23. Heanue, T. A., et al. (1999), `Synergistic regulation of vertebrate muscle development by Dach2, Eya2, and Six1, homologs of genes required for Drosophila eye formation', Genes Dev., Vol. 13, pp. 3231±3243. 24. http://¯ybase.bio.indiana.edu/genes/ fbalquery.hform/ 25. http://¯ybase.bio.indiana.edu/genes/ 26. http://¯ybase.bio.indiana.edu/.bin/expat/ 27. http://¯ybase.bio.indiana.edu/.bin/ cvreport.html?term%22tracheal system%22commandexpat 28. http://¯ybase.bio.indiana.edu/.bin/fbimage/ 29. http://¯ybase.bio.indiana.edu/.bin/ fbibq.html?FBim2715464 30. http://¯ybase.bio.indiana.edu/.bin/ cvreport.html?%27embryonicmaxillary segment%27/ 31. Carlson, E. A. (1972), `H. J. Muller (1890± 1967)', Genetics, Vol. 70, pp. 1±30. 32. http://¯ybase.bio.indiana.edu/.data/docs/ refman/refman-I.html#I.4 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 1. 68±80. MARCH 2001