Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
GUS The Genomics Unified Schema A Platform for Genomics Databases V. Babenko, B. Brunk, J.Crabtree, S. Diskin, S. Fischer, G. Grant, Y. Kondrahkin, L.Li, J. Liu, J. Mazzarelli, D. Pinney, A. Pizarro, E. Manduchi, S. McWeeney, J. Schug, C. Stoeckert Center for Bioinformatics, University of Pennsylvania stevef,[email protected] Overview Abstract The Genomics Unified Schema (GUS) is a strongly typed relational database schema and accompanying portable object-based software platform used for integration, analysis, curation, mining and presentation of sequence based genomics information. The schema is organized into five domains: a detailed model of the central dogma (gene, RNA, protein) including DNA, assembled RNA, and protein sequence, and a diversity of sequence annotation (DoTS); an MGED compliant warehouse of transcript expression experiments (RAD); a catalogue of grammars describing regulatory regions (TESS); a wide range of controlled vocabularies and ontologies (SRES); and a detailed representation of data provenance (CORE). (A sixth domain for protein expression is in progress.) GUS’s normalized relational structure and extent of integrated data enable powerful queries not viable in many other genomics systems. The platform facilitates maintenance of the warehouse and its utilization in web and data mining applications. Goals of GUS Generic platform for model organism or disease specific databases Freely available at www.gusdev.org and www.cbil.upenn.edu Integration of genome, transcript and protein data, including: Sequence Function Expression Interaction Regulation Orthologs and paralogs Support for: automated annotation and integration manual curation data mining/analysis and sophisticated queries web access GUS Powers Multiple Genomics DBs AllGenes PlasmoDB EPConDB Java Servlets DoTS RAD TESS SRES Core Oracle RDBMS Object Layer for Data Loading Other sites, Other projects Components of GUS Relational database schema Lightweight object layer Application frameworks Applications Data access Pipeline/workflow Web (servlets) Annotator’s interface Parsers and exporters (using standards) Annotation and analysis programs Schema browser Utilizes Oracle 9i Architecture of GUS GenBank, InterPro, GO, etc Genomic Sequence Automated Analysis & Integration GSSs & ESTs Annotation Object Layer DoTS Oracle/SQL TESS RAD WWW queries, browsing, & download Mapping Data Java Servlets & Perl CGI Core SRes Mining Applications microarray & SAGE Experiments QTL,POP, SNP, Clinical Annotator’s Interface Usage of GUS Annotation Integration Of genomes: gene models, sequence features Of genes: function, expression, regulation From sequence to expression Map identifiers to/from external databases Data mining, creating curated datasets Algorithm-based: GO function prediction Genome-wide querying: find all pancreas-specific transcripts PANCchip: non-redundant genes expressed in pancreas found using ESTs, microarrays and cDNA libraries GUS Schema Schema features Extensive integrated genomics schema (300 tables) Divided into 5 distinct domains Highly normalized Strongly typed Subclassing Use views of superclass to define subclasses Useful for mapping into the object layer Warehousing Controlled vocabularies used extensively Avoid using name-value pairs Include databases such as Genbank, GO terms, Prodom, CDD. Facilitates management of value-added annotation across updates Cross references to external databases Tracking and versioning Five domains GUS is divided into 5 domains* (separate name spaces) Namespace Domain Highlights Core Data Provenance Evidence Shared Resources Ontologies Sequence and annotation Central dogma Gene expression MIAME/MAGE Gene regulation Grammars SRes (Shared Resources) DoTS (DB of Transcribed Seqs) RAD (RNA Abundance DB) TESS (Trans Elem Search Site) * Protein interaction domain underway Querying across the domains Core Data Provenance •Ownership •Protection •Algorithms •Versioning •Workflows DoTS •Genes, gene models •STSs, repeats, etc •Cross-species analysis Genomic Sequence RAD SRes Ontologies •GO •Species •Anatomy/Tissue •Developmental stage •Disease state DoTS Transcribed Sequence Protein Sequence •Characterize transcripts •RH mapping •Library analysis •Cross-species analysis •DOTS assemblies •Domains •Function •Structure •Cross-species analysis RAD Transcript Expression • rrays A •SAGE •Conditions TESS Gene Regulation •Binding Sites •Patterns •Grammars SRes "Transcription factors upregulated in acute myeloid leukemia with sequence similarity to c-fos and common promoter motifs" Core TESS DoTS central dogma schema Gene RNA Protein Gene Instance RNA Instance Protein Instance Gene Feature Genomic Sequence (isa NA Feature) (isa NA Sequence) RNA Feature RNA Sequence (isa NA Feature) (isa NA Sequence) Protein Feature Protein Sequence (isa NA Feature) (isa AA Sequence) RAD schema uses MAGE/MIAME 0..* MAGE Experiment Array BioMaterial BioAssay BioAssayData Protocol, Descr. HigherLevelAnalysis StudyAssay 1 Array 1 1 0..* 1 Assay 0..* 1 1 0..* Study 1 1 1 1 1 0..* 1 0..* 0..* 1 StudyDesignAssay ArrayAnnotation StudyDesign 1 0..* 0..* 0..* Control ElementAnnotation 0..* 0..1 0..* 1 1 BioMaterialCharacteristic 0..* BioMaterialImp 1 ElementImp 1 StudyFactor 0..* 1 0..* 0..* 0..* 0..* 0..* StudyDesignDescription 0..* StudyFactorValue AssayLabeledExtract 0..* 1 Channel CompositeElementImp 1 1 10..1 0..* 0..* 0..* 0..* BioMaterialMeasurement 0..* 0..1 1 0..* 1 0..* 1 0..1 0..* Acquisition 1 1 1 0..* 0..* 1 LabelMethod RelatedAcquisition 0..* 1 0..* CompositeElementAnnotation 1 0..* 0..* 1 OntologyEntry Treatment 0..* 0..1 AcquisitionParam 0..* 0..* 0..1 ElementResultImp 0..1 0..1 CompositeElementResultImp 0..* 0..* 0..* 1 ProcessResult Quantification 0..* 0..* 1 1 1 MAGEDocumentation RelatedQuantification 0..* ProtocolParam 0..* ProcessIO 1 MAGE_ML QuantificationParam 0..* 1 0..1 0..* 1 MIAME Protocol 1 0..* Experimental Design Array design Samples Hybridization, Measure Normalization . 0..* 1 0..* 0..* 1 AnalysisInput 0..* 1 1 ProcessInvocation ProcessInvocationParam ProcessImplementationParam 1 0..* 0..* 1 0..* AnalysisInvocation AnalysisInvocationParam 1 0..* AnalysisOutput 1 ProcessImplementation 0..* 1 1 Analysis 0..* 0..* AnalysisImplementation 1 0..*AnalysisImplementatio nParam 0..* TESS schema TESS.Moiety Moiety MoietyHeterodimer MoietyMultimer MoietyComplex DoTS.NaFeature TESS.Activity ActivityProteinDnaBinding BindingSite TESS.FootprintInstance Promoter ActivityTissueSpecificity ... TESS.TrainingSet TESS.Model ModelString DoTS.NaSequence TESS.ParameterGroup ModelConsensusString ModelPositionalWeightMatrix TESS.Note ModelGrammar Ontologies and vocabularies Ontologies Gene Ontology (GO) Sequence Ontology (SO) (sequence features) Phenotype and Trait Ontology (PATO) Taxon (NCBI) Anatomy (Penn) Disease (ICD9) Developmental stage (multiple sources) And vocabularies External database names Genetic codes Review status Evidence trail Evidence and tracking Data tables have columns for user, date, project, algorithm invocation Tables dedicated to algorithm, algorithm version and parameters 176 algorithms, including public and in-house Tracks automated and manual annotation, similarity and integration Versioning All updated or deleted rows are copied to version table Sophisticated queries Sample queries from three projects that utilize GUS’s data integration and analysis www.allgenes.org http://plasmodb.org “Is my cDNA similar to any mouse genes that are predicted to encode transcription factors and have been localized to mouse chromosome 5?” “List all genes whose proteins are predicted to contain a signal peptide and for which there is evidence that they are expressed in Plasmodium falciparum’s late schizont stage” www.cbil.upenn.edu/EPConDB “Which genes on chromosome 2 are expressed in pancreas and are involved in signal transduction based on GO function assignments.” Application Frameworks GUS Object layer Lightweight Perl implementation Java on the way One object per table Parent/child relationships Cascading delete Data input The GusApplication program manages inserts and updates to GUS, handling tracking and versioning. Specific tasks are implemented as plugins. Plugins use either GUS objects or SQL access. Low-level database access is provided by DBI classes. GusApplication Plugin Object Object Object Object Object SuperClasses SQL DBI Core SRes DoTS RAD TESS Pipeline Perl API for defining annotation pipelines Supports sequential protocols Distributes compute intensive work to compute cluster Used for 90 stage pipeline to build DoTS transcript index Web Servlets and cgi based design (JSP on the way) Automatic generation of HTML FORMs Automated input checking Integrated help features INPUT elements populated from the database Query history facility Boolean queries (AND, OR, SUBTRACT) Declarative configuration file Base system is relatively independent of GUS Provided Applications Annotator’s interface Assign Gene Name/Symbol Assign Gene Description Assign Gene Synonym(s) Evidence Parsing & exporting Parsing Sequence DBs: Genbank (main, dbEST, NRDB), SWISS-PROT, TIGR Protein Motifs: CDD, Prodom, InterPro Expression: MAGE Ontologies: GO, SO, PATO Mapping data: RH maps Gene predictors: GLIMMER, Genscan, PHAT, GeneFinder Similarity: BLAST, BLAT, Sim4 CAP4 Exporting FASTA MAGE Table dumps DoTS Assemblies Analysis & annotation GO functional assignment Expression analysis (PaGE) Anatomy classification Library distribution Genes from BLAT of DoTS against genome DoTS assembly and annotation Refresh warehouse Cluster and assemble mRNAs/ESTs into putative transcripts Annotate transcripts through similarity, GO function and markers Integrate previously existing manual curation DoTS Pipeline Genomic Sequence Gene predictions GenScan/ HMMer, PHAT mRNA/EST Sequence SIM4 or BLAT Predicted Genes Merge Genes Clustering and Assembly DoTS consensus Sequences Gene/RNA cluster assignment Annotate DoTS Manual Annotation Tasks RNAs BLASTX Other computed annotation (EPCR, AssemblyAnatomyPercent, Index Key Words, SNP analysis) BLAST Similarities Gene Index framefinder translation BLASTP Functional predictions GO Functions Proteins PFAM, Smart, ProDom Protein Motifs References & Acknowledgements References Scearce, L. Marie, Brestelli, John E., McWeeney, Shannon K., Lee, Catherine S., Mazzarelli, Joan, Pinney, Deborah F., Pizarro, Angel, Stoeckert, C. J. Jr., Clifton, Sandra, Permutt, M. Alan, Brown, Juliana, Melton, Douglas A., Kaestner, Klaus H. (2002) Functional Genomics of the Endocrine Pancreas: The Pancreas Clone Set and PancChip, New Resources for Diabetes Research Diabetes 51: 1997-2004, 2002. Schug, J., Diskin, S., Mazzarelli, J., Brunk, Brian P., Stoeckert, C.J. (2002) Predicting Gene Ontology Functions from ProDom and CDD Protein Domains. Genome Res. 2002 12: 648-655. Bahl, A., Brunk, B., Coppel, R.L., Crabtree, J., Diskin, S.J., Fraunholz, M.J., Grant, G.R., Gupta, D., Huestis, R.L., Kissinger, J.C., Labo, P., Li, L., McWeeney, S.K., Milgram, A.J., Roos, D.S., Schug, J., Stoeckert, C.J. (2002) PlasmoDB: The Plasmodium Genome Resource. An integrated database providing tools for accessing and analyzing mapping, expression and sequence data (both finished and unfinished). Nucleic Acids Res. 2002 30: 87-90 Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F.C.P., Kim, I.F., Markowitz, V., Matese, J.C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J., Vingron, M. (2001) Minimum Information About a Microarray Experiment (MIAME): Toward Standards for Microarray Data. Nature Genetics 29:365-371, 2001. Manduchi, E., Pizarro, A., Stoeckert, C. (2001) RAD (RNA Abundance Database): an infrastructure for array data analysis. Proc. SPIE, vol 4266, pp. 68-78. Davidson, S.B., Crabtree, J., Brunk, Brian P., Schug, J., Tannen, V., Overton, G.C., Stoeckert, C.J. Jr. (2001) K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources. IBM Systems Journal: 40(2), p. 512-531. Crabtree, J., Wiltshire, T., Brunk, B., Zhao, S., Schug, J., Stoeckert, C., Bucan, M. (2001) High-resolution BAC-based Map of the Central Portion of Mouse Chromosome 5. Genome Res. October 2001; 11: 1746-1757. Acknowledgements NIH grant RO1-HG-01539-03 DOE grant DE-FG02-00ER62893 Burroughs Wellcome Fund NIDDK 56947 and 56954 with cosponsorship from the JDFI Related posters 114A. Web-Based Biological Discovery using the GUS Integrated Database. 170A. TESS-II: Describing and Finding Gene Regulatory Sequences with Grammars 148A. Integrating Eukaryotic Genomes by Orthologous Groups: What is Unique about Apicomplexan Parasites?