Download Generic Model Organism Database

Document related concepts

Extensible Storage Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Oracle Database wikipedia , lookup

IMDb wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Ingres (database) wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

ContactPoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Generic Model Organism
Database
Lavanya Rishishwar
Outline
• Purpose
• Genome database
• Basics of webserver & database
• GMOD
4/7/2016
Generic Model Organism Database
2
Presentation Assumption
What do we understand:
• Sequencing and computational genomics process
• The output from different groups
What we do not understand:
• The end goal
• Database (DBMS) and web service technologies
4/7/2016
Generic Model Organism Database
3
Outline
• Purpose
• Genome database
• Basics of webserver & database
• GMOD
4/7/2016
Generic Model Organism Database
4
Purpose
• By the time, functional annotation finishes their part, you will have a lot
of data to deal with
• For a biologist, this is more information than they will ever care for
• It doesn’t mean what you did was useless. The data now requires
presentation.
• This is where you have to start like a real bioinformatician – balancing the
biology and computer science
4/7/2016
Generic Model Organism Database
5
Purpose
What is needed:
• An intuitive platform that can help a biologist pose questions, inspect the
data and make inferences
• Doesn’t mean it has to be fancy
• But it does mean that it needs to have all the data in a neatly organized
fashion that doesn’t overwhelm the end user
• This is not 2007 where you do everything from scratch
• The field has matured a lot, explore before developing
• Do not reinvent the wheel
4/7/2016
Generic Model Organism Database
6
Purpose
• You want to create something that is useful
• Not something that is purely ornamental and no one will ever use it
• Whatever you create, should have a purpose
• What is the purpose here?
• What is your client (CDC/Dr. Xin Wang) looking for?
• How can you give the client exactly what they want and in a way that they
will understand?
4/7/2016
Generic Model Organism Database
7
Outline
• Purpose
• Genome database
• Basics of webserver & database
• GMOD
4/7/2016
Generic Model Organism Database
8
Database
• Organized collection of data
• Can be flat files to more sophisticated systems
• What was wrong with flat files? Why did we require advanced systems?
4/7/2016
Generic Model Organism Database
9
Genome Database
• A database specializing in genomic information
• What is different from other databases:
•
•
•
•
•
•
What is stored?
Who uses it?
Who do you query it?
The natural relationship between the dataset
Linkouts to existing databases
Information presentation manner – less text, more visual
Q: Okay, so how do I make a database?
4/7/2016
Generic Model Organism Database
10
What do you require?
• Data (and knowledge of what that data is)
• A Database Management System (DBMS)
• A database developer/administrator
• A frontend designer
4/7/2016
Generic Model Organism Database
11
What do you require?
• Data (and knowledge of what that data is)
Output from different groups
• A Database Management System (DBMS)
MySQL hosted at compgenomics server
• A database developer/administrator
Developer – one of you; Administrator – Troy Hilley
• A frontend designer
One of you
4/7/2016
Generic Model Organism Database
12
Outline
• Purpose
• Genome database
• Basics of webserver & database
• GMOD
4/7/2016
Generic Model Organism Database
13
Database Management System (DBMS)
Q: What is a DBMS?
A: Collection of tools to help in creating, storing, modifying and extracting
information from a database
Q: Why not something like Excel?
A: Issues with scalability, consistency, redundancy, simultaneous access and
within data connectivity
4/7/2016
Generic Model Organism Database
14
Database Management System (DBMS)
• Scalability
Spreadsheets/flat files store data in a single file. As data grows, basic
operations becomes unviable.
• Consistency
The data needs to be in the same format for pattern searching. Difficult
to achieve in spreadsheets, not possible in flat files.
4/7/2016
Generic Model Organism Database
15
Database Management System (DBMS)
• Redundancy
Redundant information uselessly increases data size and may interfere
with pattern searching.
• Simultaneous access
Limited simultaneous access in spreadsheets/flat files.
4/7/2016
Generic Model Organism Database
16
Database Management System (DBMS)
• Within data connectivity
Difficult to maintain in spreadsheets and flat files.
• E.g. hypothetical database for outbreaks. Three types of data:
• Strain information
• Hospital information
• CDC personnel information
• If any data requires update, every single one needs to be updated!
4/7/2016
Generic Model Organism Database
17
Database Management System (DBMS)
DBMS was specifically designed to resolve these issues
Q: What are my options for a DBMS?
A: Many!
Relational DBMS: MySQL, Oracle, Microsoft Access, Postgre SQL
Non-relational: MongoDB, CouchDB, Google Spanner
4/7/2016
Generic Model Organism Database
18
Relational DBMS
Figure from http://www.ibm.com/
4/7/2016
Generic Model Organism Database
19
Relational DBMS
• Are defined in Structured Query Language (SQL)
• Looks like this:
CREATE TABLE STATION (ID INTEGER PRIMARY KEY, CITY
CHAR(20), STATE CHAR(2), LAT_N REAL, LONG_W REAL);
INSERT INTO STATION VALUES (13, 'Phoenix', 'AZ', 33,
112);
• Why do I point this out?
Ideally, you will be implementing dynamic browsers that fires these
queries
4/7/2016
Generic Model Organism Database
20
Database Basics
DBMS
(Backend)
4/7/2016
Graphical User
Interface (GUI)
Generic Model Organism Database
21
Webserver Basics
• Webserver – any computer connected to the internet that provides some
sort of service
• These services are provided through specific protocols. E.g. HTTP, FTP
HTTP : Hypertext transfer protocol, FTP : File transfer protocol
• A special software on the server facilitates this communication (answers
the request) – HTTPD (HTTP Daemon) i.e., the web server
• Most widely used web server – Apache
• This will be setup for you already. You will most probably require specific
changes to the setup once you are ready with the browser
4/7/2016
Generic Model Organism Database
22
Security
Two levels:
• Security at access to the server
Responsible: Sys-admin and frontend designer/developers
• Security at access to the database
Responsible: Database administrators
• How much should you care?
Do not create a security hole in the server. This will not only open your
browser to the attacks but will also open the whole Georgia Tech system to
the attacks.
• What if you (un)intentionally create a security hole
Troy has some fool-proofing in place. If you bypass it somehow, Georgia
Tech’s IT staff monitors the network constantly and they typically detect the
attack within an hour and block it.
And then they will find you.
And they will make sure you never do such an act ever again.
4/7/2016
Generic Model Organism Database
23
Frontend
• Your normal webpage
• HTML – Hypertext markup language
• Styling – CSS (Cascading style sheets)
• More library functions – Javascript/Jquery
• Fancy templates - bootstrap
4/7/2016
Generic Model Organism Database
24
Server-side scripting
• Special type of programming language
• Executes on and by the webserver. Can’t run locally.
• E.g. PHP (originally: Personal homepage. Now: Hypertext preprocessor.
Don’t ask how.),
JSP, Perl via CGI, R, Python
• I recommend PHP for you – easy to pick, constructs very similar to
Perl/Java/C, GMOD uses it, this is a class project
4/7/2016
Generic Model Organism Database
25
Outline
• Purpose
• Genome database
• Basics of webserver & database
• GMOD
4/7/2016
Generic Model Organism Database
26
Who is this crazy looking guy?
William James (Jim) Kent. Know that name. Respect this guy.
He is one of greatest, perhaps the greatest, bioinformatics programmers ever.
He was deeply involved in the assembly of the public human genome project.
Along with his PI and a cluster of 50 PCs, this guy raced with Celera (JCV; was using
a most powerful civilian supercomputer) and finished the draft human genome 3
days before them. Bam!
If you were in the fall class, you compiled the James
Kent Source tree. Almost all his.
He speaks nothing but the truth.
4/7/2016
Generic Model Organism Database
27
He knows what a genome browser should be
“Genome browsers facilitate genomic analysis by presenting alignment,
experimental and annotation data in the context of genomic DNA
sequences.”
Melissa S Cline & James W Kent, 2009
Genome browsers aggregate data
4/7/2016
Generic Model Organism Database
28
The UCSC Genome Browser
Clicking on any of these takes you to
a page full of details CDKN2A
4/7/2016
Generic Model Organism Database
29
Tracks don’t have to be genes
They are many kinds of genomic information
• Homologous sequences
• Conservation
• Protein domains
• Signaling peptide
• CRISPR elements
• Regulatory elements
• RNAs
• Transcription factor binding sites
• Conservation of genomics sequences
• Extremely important in modern times are tracks displaying *-seq data
4/7/2016
Generic Model Organism Database
30
What’s good about the UCSC GB?
• Arguably the most advanced genome browser, it is much more than a tool
for looking at genomes
• It integrates a huge amount of data for each gene it displays.
• The UCSC also has a graphical front end for downloading from its huge
backend database
4/7/2016
Generic Model Organism Database
31
This UCSC browser does so much more
• It hosts the ENCODE project, one of the largest, probably the largest,
assemblies of functional genomic data.
• It let’s you jump between orthologous regions in different genome:
CDKN2A
• It’s a massive, massive database backend of over 6500 tables.
4/7/2016
Generic Model Organism Database
32
So why aren’t there dozens of UCSC
Implementations?
It’s really, really, really hard to install.
It’s impossible to understand unless you’ve tried to do it.
The UCSC genome browser works so well for the genomes that it has
because it is so very, very specialized for those genomes.
Each track in the UCSC browser has been lovingly crafted.
4/7/2016
Generic Model Organism Database
33
Browser choices
There are a number of choices out there for a genome browser
There are really just 2 big ones:
UCSC
GMOD & GBrowse/JBrowse
We already discussed why you don’t use the UCSC browser for projects
4/7/2016
Generic Model Organism Database
34
GMOD
• Model Organism Database (MOD) groups realized this issue and
developed the Generic Model Organism Database (GMOD)
• Generic => Yes, valid for any type of organism
• Model Organism => That is a lie
• Database => Core purpose
4/7/2016
Generic Model Organism Database
35
GMOD
4/7/2016
Generic Model Organism Database
36
GMOD
• GMOD is an open source software
• Has a number of tools that fall into three main components:
• Data Management
• Annotation
• Visualization
• Next set of slides discusses the different utilities in it
4/7/2016
Generic Model Organism Database
37
GMOD – Data Management
4/7/2016
Generic Model Organism Database
38
GMOD – Data Management
4/7/2016
Generic Model Organism Database
39
GMOD – Data Management
4/7/2016
Generic Model Organism Database
40
GMOD – Data Management
4/7/2016
Generic Model Organism Database
41
GMOD – Data Management
4/7/2016
Generic Model Organism Database
42
GMOD – Annotation
4/7/2016
Generic Model Organism Database
43
GMOD – Annotation
4/7/2016
Generic Model Organism Database
44
GMOD – Annotation
4/7/2016
Generic Model Organism Database
45
GMOD – Annotation
4/7/2016
Generic Model Organism Database
46
GMOD – Visualization
4/7/2016
Generic Model Organism Database
47
GMOD – Visualization
4/7/2016
Generic Model Organism Database
48
GMOD – Visualization
4/7/2016
Generic Model Organism Database
49
GMOD – Visualization
4/7/2016
Generic Model Organism Database
50
Genome Browser
4/7/2016
Generic Model Organism Database
51
You can use colors for information
• This shows genes that we thought were horizontally transferred
• Darker genes had more programs that indicated them being horizontally
transferred
4/7/2016
Generic Model Organism Database
52
You can also have specialized tracks
• We had a track of virulence factors in the first year
• Clicking on any of them took you to details for the gene, a link to VFDB,
etc.
4/7/2016
Generic Model Organism Database
53
This goes beyond colors
You can alter how tracks are show in other ways
Add and remove tracks, change the link that appears over a feature in the
genome.
4/7/2016
Generic Model Organism Database
54
You can do even more customization
You can make a transcriptome browser. It doesn’t have to be a genome
Not super exciting from this view. Just the predicted coding region of an
assembled contig (mRNA)
4/7/2016
Generic Model Organism Database
55
All of this is in the conf
4/7/2016
Generic Model Organism Database
56
Configuration file (.conf)
Making a new Track
### TRACK CONFIGURATION ###
[ExampleFeatures]
feature = remark
glyph = generic
stranded = 1
bgcolor = orange
height = 10
key = Example Features
Genome Browser
• The images are not created beforehand – they are generated on the
runtime
• The server does the processing and sends out images to the user
4/7/2016
Generic Model Organism Database
59
Genome Browser
1. The user types in the URL:
gbrowse2012.biology.gatech.edu
4/7/2016
Generic Model Organism Database
60
Genome Browser
2. Browser interprets and sends the request
to HTTP Server
4/7/2016
Generic Model Organism Database
61
Genome Browser
3. Web Server receives the request and
“serves” the client, i.e., starts Gbrowse
4/7/2016
Generic Model Organism Database
62
Genome Browser
4. In case of success, relevant hypertexts and
multimedia is generated by accessing the
database
4/7/2016
Generic Model Organism Database
63
Genome Browser
4/7/2016
Generic Model Organism Database
64
Genome Browser
5. The output traverses the same path back
4/7/2016
Generic Model Organism Database
65
Genome Browser
6. The whole process repeats again when the
user interacts with the browser
4/7/2016
Generic Model Organism Database
66
GBrowse & JBrowse
• The previously discussed process is GBrowse’s implementation
• JBrowse is another implementation of the same browser but in Javascript
• Pros: Reduced server load, smoother UI, better images
• Cons: Still lacks some of the features of GBrowse
4/7/2016
Generic Model Organism Database
67
Populating a Genome Database
• Most of the genome browsers (including GBrowse/JBrowse) expect input
files in a specific format known as GFF
• GFF => General Feature Format
• Text based, table like format
• Capable of storing any type of information
4/7/2016
Generic Model Organism Database
68
General Feature Format (GFF)
Format description: GFF
General Feature Format is a tab separated format used for describing different omics features
The format has 9 mandatory columns as shown below:
Seqname
4/7/2016
Source
Feature Start
End
Generic Model Organism Database
Score Strand Frame Attribute
69
Conclusion of GBrowse/JBrowse
Will it ever be the greatest genome browser?
No. That will always be the UCSC browser
Will it remain the easiest to install for some time?
Probably – JBrowse seems more promising though
Will you get the best return on time spent
Yep
4/7/2016
Generic Model Organism Database
70
Conclusion of the talk
• A database is an efficient and organized method of storing data
• It allows faster mining patterns from the data
• Two component system: backend storage and frontend interface
• Most widely used option: GMOD
• Web based genome browser options: GBrowse/JBrowse, UCSC
4/7/2016
Generic Model Organism Database
71
Questions?
4/7/2016
Generic Model Organism Database
72