Download An overview of GUS - University of Georgia

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data center wikipedia , lookup

Expense and cost recovery system (ECRS) wikipedia , lookup

Data model wikipedia , lookup

Database wikipedia , lookup

Data analysis wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

3D optical data storage wikipedia , lookup

Information privacy law wikipedia , lookup

Business intelligence wikipedia , lookup

Versant Object Database wikipedia , lookup

Open data in the United Kingdom wikipedia , lookup

Data vault modeling wikipedia , lookup

Database model wikipedia , lookup

Transcript
A Brief Introduction on GUS
Ed Robinson
November, 2004
[email protected]
Jessica Kissinger Lab
Center for Tropical and Emerging Global Diseases
University of Georgia
Athens, Georgia, 30602, USA
http://mango.ctegd.uga.edu/jkissingLab/
Part I: Introduction to GUS
Resources:
Websites:
1
http://www.gusdb.org
http://www.gusdb.org/wdk
http://www.cs.uga.edu/~gao/project/gus/gus.htm
Wikis:
GUS: https://www.gusdb.org/wiki
GUS WDK: https://www.gusdb.org/wiki/index.php/GusWdk
APIDB: https://www.cbil.upenn.edu/apiwiki/
CBIL: (there is a cbil wiki, but it is for cbil only)
Mailing Lists:
GUS: http://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
[found on the GUS home page]
WDKDEV: https://mail.pcbi.upenn.edu/mailman/listinfo/wdkdev
[found on the GUS WDK home page]
APIDB: https://mail.pcbi.upenn.edu/mailman/listinfo/apidb
GUSDBA: https://mail.pcbi.upenn.edu/mailman/listinfo/gusdba
[found on the GUS home page (or will be very soon...)]
CVS
GUS: [GUS and WDK]
http://cvsweb.sanger.ac.uk/ (public access)
:ext:cvs.sanger.ac.uk:/cvsroot/GUS (write permission)
CBIL: [CBIL, DoTS, DJob, RAD]
http://cvs.cbil.upenn.edu
:pserver:[email protected]:/CBIL
APIDB: [TOXO, PLASMO, CRYPTO, APIDB]
:pserver:cvs.cbil.upenn.edu:/apidb
1. General Introduction
Java Servlets
Core
2
Three Broad Components:
1) Perl Business Objects.
There is a very complicated PERL application, GA, which controls
putting data into GUS. This application is object oriented, so it has a
component hierarchy part of which reflects the GUS database schema
structure. You will primarily be concerned with a class of objects called
Plugins which load data into the GUS database. These are the most
important objects for you to learn to use.
2) Database Schema
The GUS schema is the real heart of the GUS application. It allows
us to store all the different kinds of data related to a whole genome project
in a single, relational database.
3) WDK
The WDK is a java/jsp/struts based application for building interactive
views of GUS data. The WDK allows you to visualize and interact with your
Genome project. Additional aspects of the WDK that are being developed
include a web-services layer which will allow programmatic access to the
WDK and a GBrowse export application. (note: Do not use the WDK classic
which is an un-supported, depreciated application)
2. GUS
Database Schema
There are 7 schemas in the GUS system, only 5 of which you need to worry
about and only 3 of which we will work with today.
Core
Application Tables
Tables of stuff you need
to run the application.
SRes
Shared Resources
Shared resources to
define projects and
structure data in the other
schemas. Mostly look up
tables.
Your data
GenBank
NRDB
dbEST
SNPs
Genetraps
MicroArrays
Phenotypes
Pathways
Orthologs
Taxonomy
GO
SO
EC
More…
Pipeline API
Plugins (data loaders)
Data Load API
Web
Development
Kit
Perl Object Layer
Queries
And
analysis
Warehouse
(Oracle or PostgreSQL)
3
Core
Application Tables
Tables of stuff you need
to run the application.
Dots
Transcript and Feature
Tables (Sequence and
Annotation/Central
Dogma)
Primary tables for storing
transcripts, proteins,
features, genes and all
related annotations
(including controlled
vocabularies such as GO
and Pfam, ProSite or
other Protein Family
Libraries)
Rad
RNA abundance
Gene Expression Data
(schema for microarray
and related data.)
Tess
Transcription Element
Search System
Gene Regulation Data.
(transcription factors,
binding sites, etc.)
Lims
Lab Information
Management System
A New Schema
Prot
A New Schema
The schema you will be most concerned with is the DoTS schema, which is
organized along the lines of the central dogma of molecular genetics. Guiding
Principals of the GUS Schema Design
Strong Typing:
All of these schemas are strongly typed at the level of genetic description, which
is to say that the tables represent specific things in genetic theory. The structure
of the database looks like things and relationships that geneticists talk about.
Gene
This means, that the database is not well normalized. Nor are there very many
lookup tables since data is categorized according to the tables it is found in (e.g.,
There isn’t a single SEQUECE table withNA
the differences bewteen sequences tied
Feature
to a SEQUNCE_TYPE in a lookup table. Rather, there are multiple
SEQUENCE_IMP tables for Protein, RNA and DNA)
4
Sub-classing:
Of course, there are further distinctions that may be made between, for instance,
types of NA sequences. But, all NA sequences are in the same table, because
this is their type. To make such distinctions, GUS utilizes sub-classing. Tables
which contain data all of the same genetic type (e.g. NASequence) are stored in
the same implementational table (NASeauenceIMP), different sub-types of data
in this table are given different sub-class views. Thus, NASeauenceIMP has
multiple views including ExternalNASequence. Views are used to make specific
objects out of more generic central dogma elements.
In the second figure, all four green rectangles, which are the four primary objects
of the central dogma, have their own implementation tables along with numerous
subclass_views of the base table.
Object-like Modeling:
The effect of subclassing is to present the database in an object-like way to the
Business Layers. Thus, the plugins insert to a sub_class view, rather than to an
implementational table. These sub_class views then define what other
information has to be inserted along with the generic information. For instance,
all classes of an NASequence must come with a sequence, sequence Id, etc.
But the ExternalNASequence requires some other information including external
source ids.
Course suggestion: relate objects in the model/schema directories to the GUS
schema.
SRes and Project Tracking
Data inserted into GUS is associated with a number of fields that allow projects,
data-sets and data versions to be correctly organized. These include ids for external
databases, data releases, projects, users and data status. Most of these data sets
reside in the GUS SRes (Shared Resources) Schema. In order to load any data into
GUS, you must have the most important of these shared resource data sets loaded so
the newly inserted data can be correctly associated with this data. The pluggins expect
to find entries already in the database for projects, data sources and taxons in SRes
when you load data. (additionally, note the entries that track each bit of data entered
into a table as well as how even algorithms are stored and registered in the GUS
system)
It is very important to note that the information in SRes is not part of the
installation. Yet, despite this fact, most pluggins make hardcoded assumptions about
SRes lookup values. There are bootstrapping scripts on the wiki, but these are from
non-central sources. Further, the hardcoded programming most pluggins use to handle
these values is not consistent from pluggin to pluggin.
I cannot stress how much you must, when you build your production systems,
5
decide what you want in the tables these scripts provide data for. These are lookup
tables, and you should only have one value per look-up item and these values must be
reflected in your pluggins. However, the parsers were written by different people, so
they often have different values. (two examples are review_status and sequence_type)
4. The Perl Object Layer(s)
The Architecture:
The GS system is a muli-layered, object-oriented Perl application. The main
level you will use to interact with GUS will be through a set of objects called
Pluggins.
ga and Plugins:
Plugins are the interface with the GUS object layer you will lose to put data
into GUS. Plugins provide command line interfaces giving you access to the
PERL object layer. All plugins invoke the objects which were created to
represent the GUS data model as well as other objects such as those for making
a database connection.
One important set of objects the plugins also invoke include the CBIL library.
Future plugins will also be using BioPerl objects.
All interaction with GUS, on the data loading side, should be though the
command line interfaces provided by the pluggins. It is important that you restrict
your access to the database to these plugins to ensure that you do not corrupt
the data in GUS. The Object layer is responsible for making sure that all of the
necessary steps are taken to maintain data integrity within the application.
All plugins are called via ga, the gus application
Syntax:
ga [<mode>] [<plugin_class_name>] [<plugin_class_options>] [<ga_options>]
<mode>
+meta – creates Core.Algorithm Core.AlgoritmImplementation entry.
+create creates Core.Algorithm, AlgorithmImplementation, AlgorithmParamKey
+update creates AlgorithmImplementation and AlgorithmParamKey
+history lists invocations
+run runs the called plugin (default value)
<plugin_class_name> hierarchical namespace designation of a particular plugin: e.g.
GUS::Common::Plugin::SubmitRow
6
<plugin_class_options> - plugin specific arguments
<ga_options> - generic options for any plugin called via ga. All of them have defaults.
--commit
--project
--debug
--comment
--verbose
--algoinvo
--veryVerbose
--gusconfigfile
--sqlVerbose
--help
--user
--helpHTML
--group
To use a plugin, you must register it with the GUS application. This means
you must call it with ga +create:
ga +create GUS::Common::Plugin::SubmitRow --commit
The plugins themselves are located in a directory that mirrors the object
hierarchy as follows:
$GUS_HOME/lib/perl/GUS::Common::Plugin::SubmitRow
(note: $GUS_HOME/lib/perl should be in your PERL path.)
Once a plugin is created, any changes to the plugin require that you update
it's entry. In order to run an update howevere, you must make sure to
advance the version number of the plugin. Early in the code of the plugin,
you should check for something similar to the following line:
cvsRevision => '$Revision: 1.877 $', # keyword filled in by cvs
Advance your version number here, so you can load the updated plugin.
How to get the syntax back from them
Some important notes on plugins:
When running Plugins with Attribute/value lists, generally the attribute list is
comma separated (e.g. 'name,place') while the value list is separated by a
string of three ^'s (e.g. 'Kissinger Lab^^^Athens, GA')
Generally speaking, most plugins expect you to know your Project_id
External_Database_Release_Id and possible a taxon name. Make sure
you have these values in hand when you plan to run a plugin. It is a good
idea to keep a written list of them so you don't have to look them back up
over and over in the database.
In most cases, the plugins handle dates BACKWARDS from Oracle. 24-JUL2004 oracle time is 04-JUL-24 GUS time.
7
(As a final note to everyone, talk about the new parsers and how everyone should
wait for them given how muh hardcoding and non-standard stuff is in the old
pluggins)
8