Download Genome browsers aggregate data

Document related concepts

Microsoft SQL Server wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Genome Browser
The Plot
Deepak Purushotham
Hamid Reza Hassanzadeh
Haozheng Tian
Juliette Zerick
Lavanya Rishishwar
Piyush Ranjan
Lu Wang
The Outline
•
•
•
•
The Need & The Requirement
The Options
The Chosen One
The New Age
Why one should develop a Genome Browser
THE NEED
Why A Genome Browser?
I want to
analyze this
organism
Why A Genome Browser?
I want to
analyze this
organism
Metabolic
Pathways
What is expected out of a Genome Browser
THE REQUIREMENT
A Genome Browser?
I want
something
manageable
A Genome Browser!
The Genome Browser
“Genome browsers facilitate genomic analysis
by presenting alignment, experimental and
annotation data in the context of genomic
DNA sequences.”
Melissa S Cline & James W Kent, 2009
Genome browsers aggregate data
Taken From Andy Conley’s slides without permission
A Short Survey of the available Genome Browsers Modules
THE OPTIONS
A Brief Time Travel
• FlyBase, SGD, MGD, and WormBase
• Setting up an MOD is expensive and time-consuming.
• The four MODs agreed in the fall of 2000 to pool their
resources and to make reusable components available
to the community free of charge under an open source
license.
• The goal of this NIH-funded project, christened GMOD,
is
“…to generate a model organism database
construction set that would allow a new model
organism to be assembled by mixing and matching
various components.”
GMOD
Who uses GMOD?
GMOD Components
Visualization - GBrowse
Visualization
JBrowse
GBrowse Synteny
CMAP
DATA MANAGEMENT
Chado
Tripal
(http://www.cacaogenomedb.org/)
TableEdit
BioMart
InterMine
ANNOTATION
MAKER
DIYA
Galaxy
Ergatis
Apollo
REALLY EXCITING OPTION!
JBrowse
• Smooth, fast navigation
(think Google Maps for genomes )
JBrowse
• Smooth, fast navigation
(think Google Maps for genomes )
• Supports BED, GFF, Bio::DB::*, Chado, WIG, BAM, UCSC
(intron/exon structure, name lookups, quantitative
plots)
• Relies on pre-indexing to minimize security exposure
and runtime bandwidth/CPU load on the server (future
versions more likely to do some server work at runtime)
• Has an API for customized track/glyph extensions
• Is stably funded by NHGRI, with many interesting
innovations implemented & pending integration
Smoother UI
Most Genome browsers
How is JBrowse different?
First look: Live Demo
A couple of JBrowses around the web
• http://intron.ccam.uchc.edu/JBrowse/Dmel/
• http://jbrowse.org/ucsc/hg19/
Types of Tracks
Pros
• Fast and smooth!
• User Friendly
• Works nicely on an iPad/iPhone too
Cons
• No user-uploaded data support
• Slow for big numbers of reference seqs (e.g.
5,000 annotated contigs)
• Few glyph options, feature tracks are limited
by the facts of <div>
What to pick?
Tried and tested
?
Fancy concept
Gbrowse and its Features
THE CHOSEN ONE
GBrowse
• Most popular web based genome browser
• Visualize genome features along a reference
sequence
• Open Source
• Highly customizable
• Excellent usability
• Rich set of “glyphs”
– Genome features
– Quantitative Data
– Sequence Alignments
GBrowse
Header
Main Browser Window
Track Menu
Under The Hood
• Client-Server
Architecture
• GBrowse Architecture
• Installation Issues
• Input Data
• Configuration File
• Customization
Client Server Architecture
1. The user types in the URL:
browser2012.biology.gatech.edu
Client Server Architecture
2. Browser interprets and sends the request
to HTTP Server
Client Server Architecture
3. Web Server receives the request and
“serves” the client i.e., starts Gbrowse
Client Server Architecture
4. In case of success, relevant hypertexts and
multimedia is generated by accessing the
database
Client Server Architecture
5. The output traverses the same path back
Client Server Architecture
5. The output traverses the same path back
Client Server Architecture
6. The whole process repeats again when the
user interacts with the browser
How you see what you see
Juxtaposed Images
How are so many images generated?
How you see what you see
+ Hyper Text files
How you see what you see
Multimedia files + Hyper Text
GBrowse Architecture
Stein L D et al. Genome Res. 2002;12:1599-1610
©2002 by Cold Spring Harbor Laboratory Press
The Bio::DB::SeqFeature database Schema
Parent2Child
Name
1
n
Type List
1
n
1
Attribute
Feature
n
1
n
n
Location List
n
Attribute List
1
1
Data file (.gff3)
Source
Eg:
Prodigal/
Reference Glimmer
Sequence
(Chr/Clone
/Contig)
Type
(sequence
ontology
(SO) terms)
Start
End
Score
Eg: Evalue
Strand
Phase
(0/1/2)
Attributes
Format:
tag=value
Attributes (Data file)
Different tags have predefined meanings:
• ID: Gives the feature a unique identifier. Useful when grouping features
together (such as all the exons in a transcript).
• Name: Display name for the feature. This is the name to be displayed to
the user.
• Alias: A secondary name for the feature. It is suggested that this tag be
used whenever a secondary identifier for the feature is needed, such as
locus names and accession numbers.
• Note: A descriptive note to be attached to the feature. This will be
displayed as the feature's description.
Alias and Note fields can have multiple values separated by commas. For
example : Alias=M19211,gna-12,GAMMA-GLOBULIN
• Other good stuff can go into the attributes field.
Gbrowse Configuration File
•
•
•
•
•
Global Website Settings
Additional HTML Pages
JavaScript
Jquery
Global Database
Settings
• Data Source Definitions
Customizations
Configuration file (.conf)
Making a new Track
### TRACK CONFIGURATION ###
[ExampleFeatures]
feature = remark
glyph = generic
stranded = 1
bgcolor = orange
height = 10
key = Example Features
Adding Multiple Tracks
Data:
Configuration:
Searchable
Links
Result UI:
Popup balloons
with links
Searching for Features
click
Gene symbols
Gene IDs
Sequence IDs
Genetic markers
Relative nucleotide coordinates
Absolute nucleotide coordinates
etc...
Viewing Multiple Tracks
Low Magnification
Viewing Multiple Tracks
High Magnification
In short…
• Main features (Determination of protein
coding and non-coding,…)
• Quantitative data (E-value, Identity
percentage)
• Other evidences (Interpro, CoGs, etc.)
• GC content and other useful measurements
• Protein and DNA sequences
Value-Added Additions
THE NEW AGE
What’s New
RICHER ANNOTATION
Richer Annotation
INCREASED ANNOTATION INFO
3000
Total Genes
2500
Pangenome Hits
UniProt
2000
1500
1000
500
0
M19107
M19501
M21127
M21621
M21639
M21709
Richer Annotation
INTEGRATED QUALITY SCORE
Origin of Database Matches
Color code was used for matches originated from different databases
Quality Value Integration
It distinguishes between different databases…
However, for matches from the same database…
Quality Scores
Origin of Database Matches
Color code will also be used for matches with different quality…
Different E-values
shown with different
shades of colors
What’s New
MORE LINK-OUTS
COGs
KEGG ID
What’s New
PATHWAYS
KEGG ID
KEGG
Compound
KEGG
Genes
KEGG
Pathway
Synthesis!
ORGANISM SPECIFIC PAGES
Organism Summary Page
• At this point of the course, we have gathered a
lot of information for the strains we are
dealing with
• Not all of this information could be
represented inside the genome browser
• We propose a separate section in the browser
containing strain-wise summarized
information
Organism Summary Page
• Conceptually, the page could contain:
– Biological information
– Assembly information:
Genome Size, Number of contigs, N50, Sequencing
platform
– Gene Prediction information:
Number of protein coding and non-protein coding genes,
links to 16s rRNA gene
– Annotation information:
Percent annotation, function distribution pie
– Comparative information:
Unique protein clusters, etc.
Organism Summary Page
Adding more values
OPERONS
Operons
• Operon
“…is a functioning unit of genomic DNA
containing a cluster of genes under the control of
a single regulatory signal or promoter”
• ~70% of the genes have been assigned a unique
OperonID
• OperonID will provide an additional browsing
mechanism for biologist connecting cotranscribed and co-regulated genes.
Operons
Incorporating Operon Information
More with Comparison
BRIG PATTERN
BRIG Patterns
• Concept:
To either generate BRIG images at run time
or load static images when the user requests
for BRIG Pattern between two species
BRIG Patterns
That’s All Folks!
• Questions?
• Comments?
• Concerns?
• If you have any suggestions, we would love to
hear from you! (There is a page on Wiki for it!)