Download It’s not ‘just’ about BIGDATA from BIGDATA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
It’s not ‘just’ about BIGDATA
How to get to actionable clinical knowledge
from BIGDATA
Subha Madhavan, Ph.D.
Innovation Center for Biomedical Informatics
Georgetown University
Scale
• The world’s current sequencing capacity is
estimated to be 13 quadrillion DNA bases a year
requiring 3 or more exabytes of storage
• The NIH-funded 1000 Genomes Project deposited
200 terabytes of data into the GenBank archive
during the project’s first 6 months of operation,
twice as much as had been deposited into all of
GenBank for the entire 30 years preceding
• Million Genome projects??
Motivating Scenarios
• BIGDATA for the Clinician
– Personalized Genomic Medicine
Filter by pre-existing
association;
gene gene set, or
region
By importance based
on conservation;
PhastCons, Phylop
By Gene Location;
Protein coding/nonsynonymous/promoter
Exclude likely errors;
quality scores,
Pseudogenes
By expected effect on
protein;
Polyphen, Splice,
premature stop
By existing diseasephenotype
association;
HGMD, OMIM
By novelty or expected
allele frequency;
dbSNP, other SNP DBs
By anticipated mode
of inheritance,
zygosity, autosomal
Short list of Variants
report annotated for
clinical interpretation
Real Cost of Sequencing
Sboner et al. Genome Biology 2011, 12:125
Motivating Scenarios (2)
• BIGDATA for the Epidemiologist – Public
Health
– Monitoring disease incidence and pathogenic
outbreaks
– Real time microbial detection for controlling
infectious diseases
Levels of data
MegaGenomes Data Hierarchy
Level 0
• Unaligned raw reads
– Single
– Paired end/mate pair
– Complex sub-reads
• Quality scores
– Per-base position
– Homopolymer flow intensity
• Access pattern:
– Sequential over reads (parallel over sub-reads)
– Write-once, read rarely
• Storage:
– Not necessary, if level 1 is a strict superset
Level 1
• Aligned reads
– Primary and secondary alignments
– Alignment quality scoring
• Access patterns:
– Write-once, read infrequently
– Interval queries, sequential
– Mate/subread lookup, quasi-local random access
• Storage:
– Highly compressed application-specific binary, 100
GB/genome
• BAM, H5BAM
– Can also be re-aligned to new references if unaligned
sequences are stored, removing need to preserve level 0 data
Level 2
• Called consensus sequence
– One or more nucleotide sequences that represent the
consensus of the read data
– Aligned to reference coordinates, variable ploidy
– Allele-specific ploidy estimates
– Missing and reference nucleotides represented
– Consensus quality scores
• Structure graph
– Based on chimeric reads and clustered sub-reads
• Access pattern:
– Write-once, read-infrequently
– Interval queries
• Storage
– Highly compressed application-specific binary
Level 3
• Extensible, annotated, multi-subject “variant
file”
– VCF-like, but variable-ploidy, preserving missing and
reference calls and quality scores
• Access pattern
– Monolithic offline updates, read-often
– Interval and subject subset queries
– Filters by annotation
Level 4 (actionable data)
• Reduced, hypothesis-specific, integrated
dataset suitable for mining
– Genomic data orders of magnitude smaller than
levels 0-2
Some practical solutions we are
leveraging
Elastic storage and computes on the cloud
•
•
Data Delivery
Fully aligned, mapped and called genomic data
De-identified clinical data
PC
User
Genome Data
Amazon S3 Bucket
•
•
Encrypted at rest
99.999999999% durability
Amazon EC2 Instances
•
•
Creation of multi-sample VCF file
Using GLU toolkit to annotate VCF file
with additional databases
G-CODE Amazon EC2 Instances
•
Web interface for ad-hoc variant
query, clinical data query and
data analysis
Clinical Data
Amazon Glacier Vault
•
•
•
•
Archive all data in “cold storage”.
Remove raw reads and mappings from S3 •
1/10th the cost of S3
Oracle RDS Instance
MongoDB EC2 Instances
Load VCF file to MongoDB NoSQL database
Allows for ad-hoc query of variation data
•
Holds clinical data and study
metadata
NoSQL Databases
Smart Data Visualization
It’s not “just” about BIGDATA
BIGDATA
•
•
•
•
Actionable Knowledge
Need better visualization techniques
Need smart/parallel processing
Policy needs to catch up with technology
Cost is an issue
Innovation Center for Biomedical
Informatics (ICBI)
http://informatics.georgetown.edu/