* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download An overview of GUS - University of Georgia
Survey
Document related concepts
Data center wikipedia , lookup
Expense and cost recovery system (ECRS) wikipedia , lookup
Data analysis wikipedia , lookup
Relational model wikipedia , lookup
Clusterpoint wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
3D optical data storage wikipedia , lookup
Information privacy law wikipedia , lookup
Business intelligence wikipedia , lookup
Versant Object Database wikipedia , lookup
Open data in the United Kingdom wikipedia , lookup
Transcript
A Brief Introduction on GUS Ed Robinson November, 2004 [email protected] Jessica Kissinger Lab Center for Tropical and Emerging Global Diseases University of Georgia Athens, Georgia, 30602, USA http://mango.ctegd.uga.edu/jkissingLab/ Part I: Introduction to GUS Resources: Websites: 1 http://www.gusdb.org http://www.gusdb.org/wdk http://www.cs.uga.edu/~gao/project/gus/gus.htm Wikis: GUS: https://www.gusdb.org/wiki GUS WDK: https://www.gusdb.org/wiki/index.php/GusWdk APIDB: https://www.cbil.upenn.edu/apiwiki/ CBIL: (there is a cbil wiki, but it is for cbil only) Mailing Lists: GUS: http://lists.sourceforge.net/lists/listinfo/gusdev-gusdev [found on the GUS home page] WDKDEV: https://mail.pcbi.upenn.edu/mailman/listinfo/wdkdev [found on the GUS WDK home page] APIDB: https://mail.pcbi.upenn.edu/mailman/listinfo/apidb GUSDBA: https://mail.pcbi.upenn.edu/mailman/listinfo/gusdba [found on the GUS home page (or will be very soon...)] CVS GUS: [GUS and WDK] http://cvsweb.sanger.ac.uk/ (public access) :ext:cvs.sanger.ac.uk:/cvsroot/GUS (write permission) CBIL: [CBIL, DoTS, DJob, RAD] http://cvs.cbil.upenn.edu :pserver:[email protected]:/CBIL APIDB: [TOXO, PLASMO, CRYPTO, APIDB] :pserver:cvs.cbil.upenn.edu:/apidb 1. General Introduction Java Servlets Core 2 Three Broad Components: 1) Perl Business Objects. There is a very complicated PERL application, GA, which controls putting data into GUS. This application is object oriented, so it has a component hierarchy part of which reflects the GUS database schema structure. You will primarily be concerned with a class of objects called Plugins which load data into the GUS database. These are the most important objects for you to learn to use. 2) Database Schema The GUS schema is the real heart of the GUS application. It allows us to store all the different kinds of data related to a whole genome project in a single, relational database. 3) WDK The WDK is a java/jsp/struts based application for building interactive views of GUS data. The WDK allows you to visualize and interact with your Genome project. Additional aspects of the WDK that are being developed include a web-services layer which will allow programmatic access to the WDK and a GBrowse export application. (note: Do not use the WDK classic which is an un-supported, depreciated application) 2. GUS Database Schema There are 7 schemas in the GUS system, only 5 of which you need to worry about and only 3 of which we will work with today. Core Application Tables Tables of stuff you need to run the application. SRes Shared Resources Shared resources to define projects and structure data in the other schemas. Mostly look up tables. Your data GenBank NRDB dbEST SNPs Genetraps MicroArrays Phenotypes Pathways Orthologs Taxonomy GO SO EC More… Pipeline API Plugins (data loaders) Data Load API Web Development Kit Perl Object Layer Queries And analysis Warehouse (Oracle or PostgreSQL) 3 Core Application Tables Tables of stuff you need to run the application. Dots Transcript and Feature Tables (Sequence and Annotation/Central Dogma) Primary tables for storing transcripts, proteins, features, genes and all related annotations (including controlled vocabularies such as GO and Pfam, ProSite or other Protein Family Libraries) Rad RNA abundance Gene Expression Data (schema for microarray and related data.) Tess Transcription Element Search System Gene Regulation Data. (transcription factors, binding sites, etc.) Lims Lab Information Management System A New Schema Prot A New Schema The schema you will be most concerned with is the DoTS schema, which is organized along the lines of the central dogma of molecular genetics. Guiding Principals of the GUS Schema Design Strong Typing: All of these schemas are strongly typed at the level of genetic description, which is to say that the tables represent specific things in genetic theory. The structure of the database looks like things and relationships that geneticists talk about. Gene This means, that the database is not well normalized. Nor are there very many lookup tables since data is categorized according to the tables it is found in (e.g., There isn’t a single SEQUECE table withNA the differences bewteen sequences tied Feature to a SEQUNCE_TYPE in a lookup table. Rather, there are multiple SEQUENCE_IMP tables for Protein, RNA and DNA) 4 Sub-classing: Of course, there are further distinctions that may be made between, for instance, types of NA sequences. But, all NA sequences are in the same table, because this is their type. To make such distinctions, GUS utilizes sub-classing. Tables which contain data all of the same genetic type (e.g. NASequence) are stored in the same implementational table (NASeauenceIMP), different sub-types of data in this table are given different sub-class views. Thus, NASeauenceIMP has multiple views including ExternalNASequence. Views are used to make specific objects out of more generic central dogma elements. In the second figure, all four green rectangles, which are the four primary objects of the central dogma, have their own implementation tables along with numerous subclass_views of the base table. Object-like Modeling: The effect of subclassing is to present the database in an object-like way to the Business Layers. Thus, the plugins insert to a sub_class view, rather than to an implementational table. These sub_class views then define what other information has to be inserted along with the generic information. For instance, all classes of an NASequence must come with a sequence, sequence Id, etc. But the ExternalNASequence requires some other information including external source ids. Course suggestion: relate objects in the model/schema directories to the GUS schema. SRes and Project Tracking Data inserted into GUS is associated with a number of fields that allow projects, data-sets and data versions to be correctly organized. These include ids for external databases, data releases, projects, users and data status. Most of these data sets reside in the GUS SRes (Shared Resources) Schema. In order to load any data into GUS, you must have the most important of these shared resource data sets loaded so the newly inserted data can be correctly associated with this data. The pluggins expect to find entries already in the database for projects, data sources and taxons in SRes when you load data. (additionally, note the entries that track each bit of data entered into a table as well as how even algorithms are stored and registered in the GUS system) It is very important to note that the information in SRes is not part of the installation. Yet, despite this fact, most pluggins make hardcoded assumptions about SRes lookup values. There are bootstrapping scripts on the wiki, but these are from non-central sources. Further, the hardcoded programming most pluggins use to handle these values is not consistent from pluggin to pluggin. I cannot stress how much you must, when you build your production systems, 5 decide what you want in the tables these scripts provide data for. These are lookup tables, and you should only have one value per look-up item and these values must be reflected in your pluggins. However, the parsers were written by different people, so they often have different values. (two examples are review_status and sequence_type) 4. The Perl Object Layer(s) The Architecture: The GS system is a muli-layered, object-oriented Perl application. The main level you will use to interact with GUS will be through a set of objects called Pluggins. ga and Plugins: Plugins are the interface with the GUS object layer you will lose to put data into GUS. Plugins provide command line interfaces giving you access to the PERL object layer. All plugins invoke the objects which were created to represent the GUS data model as well as other objects such as those for making a database connection. One important set of objects the plugins also invoke include the CBIL library. Future plugins will also be using BioPerl objects. All interaction with GUS, on the data loading side, should be though the command line interfaces provided by the pluggins. It is important that you restrict your access to the database to these plugins to ensure that you do not corrupt the data in GUS. The Object layer is responsible for making sure that all of the necessary steps are taken to maintain data integrity within the application. All plugins are called via ga, the gus application Syntax: ga [<mode>] [<plugin_class_name>] [<plugin_class_options>] [<ga_options>] <mode> +meta – creates Core.Algorithm Core.AlgoritmImplementation entry. +create creates Core.Algorithm, AlgorithmImplementation, AlgorithmParamKey +update creates AlgorithmImplementation and AlgorithmParamKey +history lists invocations +run runs the called plugin (default value) <plugin_class_name> hierarchical namespace designation of a particular plugin: e.g. GUS::Common::Plugin::SubmitRow 6 <plugin_class_options> - plugin specific arguments <ga_options> - generic options for any plugin called via ga. All of them have defaults. --commit --project --debug --comment --verbose --algoinvo --veryVerbose --gusconfigfile --sqlVerbose --help --user --helpHTML --group To use a plugin, you must register it with the GUS application. This means you must call it with ga +create: ga +create GUS::Common::Plugin::SubmitRow --commit The plugins themselves are located in a directory that mirrors the object hierarchy as follows: $GUS_HOME/lib/perl/GUS::Common::Plugin::SubmitRow (note: $GUS_HOME/lib/perl should be in your PERL path.) Once a plugin is created, any changes to the plugin require that you update it's entry. In order to run an update howevere, you must make sure to advance the version number of the plugin. Early in the code of the plugin, you should check for something similar to the following line: cvsRevision => '$Revision: 1.877 $', # keyword filled in by cvs Advance your version number here, so you can load the updated plugin. How to get the syntax back from them Some important notes on plugins: When running Plugins with Attribute/value lists, generally the attribute list is comma separated (e.g. 'name,place') while the value list is separated by a string of three ^'s (e.g. 'Kissinger Lab^^^Athens, GA') Generally speaking, most plugins expect you to know your Project_id External_Database_Release_Id and possible a taxon name. Make sure you have these values in hand when you plan to run a plugin. It is a good idea to keep a written list of them so you don't have to look them back up over and over in the database. In most cases, the plugins handle dates BACKWARDS from Oracle. 24-JUL2004 oracle time is 04-JUL-24 GUS time. 7 (As a final note to everyone, talk about the new parsers and how everyone should wait for them given how muh hardcoding and non-standard stuff is in the old pluggins) 8