Download [Company Name]

Document related concepts

Database model wikipedia , lookup

Clusterpoint wikipedia , lookup

Transcript
Poxviruses, Biodefense and
Bioinformatics
Working towards a better understanding of
viral pathogenesis and evolution
PBR
Bioinformatics
 Managing Complexity
– Technology development
 Enhancing Understanding
– Research
PBR
Managing Complexity
 Data
– Acquisition
– Storage
– Manipulation
– Retrieval
PBR
Managing Complexity…
 Data Analysis
– Development and Utilization of
• Analytical tools
• Visualization tools
PBR
Enhancing Understanding
What distinguishes one organism from another?






Sequence
Molecular Biology
Physiology
Pathogenesis
Epidemiology
Evolution
Will the genomic sequence provide an explanation for the
differences?
PBR
What is Bioinformatics?
 Computer-aided analysis of biological information
 Discerning the characteristic (repeatable) patterns
in biological information that help to explain the
properties and interactions of biological systems.
 Caveat:
– In the end, bioinformatics (a.k.a. computers) can only
help in making inferences concerning biological
processes.
– These inferences (or hypotheses) have to be tested in
the laboratory
PBR
The Poxvirus Bioinformatic Resource
www.poxvirus.org
PBR
PBR Collaborators
 UAB
– Elliot Lefkowitz
 St. Louis University
– Mark Buller
 University of Victoria
– Chris Upton
 ATCC
– Charles Buck
 Medical College of Wisconsin
– Paula Traktman
PBR
The UAB MGBF Contingent
Molecular and Genetic Bioinformatics Facility
 Programmers
– Jim Moon
– Don Dempsey
– Uma Dave
– Bei Hu
 Students
– Chunlin Wang
 Fellows
– Shankar Changayil
– Xiaosi Han
PBR
Poxviruses
 Large dsDNA genome
– 150,000 – 300,000 base pairs
– 150 – 260 genes
 Complex virion morphology
 Cytoplasmic replication
 Array of immunoevasion strategies.
 Human pathogens
– Molluscum contagiosum
– Variola
– Monkeypox
PBR
The PBR is Designed to Support
Basic and applied research on Poxviruses
including the development of new:
 Environmental Detectors
 Diagnostic Reagents
 Animal Models
 Vaccines
 Antiviral Compounds
PBR
PBR Design Philosophy
 Useful and Used
 Supporting all poxvirus investigators
– UAB PBR Web-based application requirements
• Web Browser
• Java plugin
 In-depth analyses
– UVic analytical tools
PBR
BLAST
 Search a sequence database for primary sequence
similarities to some query sequence
 Provides a measure of the significance of the
similarity
 Does not necessarily imply common evolutionary
origin
 Developed at NCBI
– Altschul, S.F., Gish, W., Miller, W., Myers, E.W. &
Lipman, D.J. (1990) "Basic local alignment search
tool." J. Mol. Biol. 215:403-410.
18 Genomes; 563 genes = Avg. 31 genes/genome
PBR
PBR Knowledge Database
 Mini review of available structure-function information
– Human-curated database based on the literature
 Bibliographic information
 Available scientific resources
• clones, mutants, and antibodies
 Empirically-derived properties
– MW, pI . . .
– Post-translational modifications
– Expression
 Functional Assignments
– Gene Ontology controlled vocabulary
• Molecular function
• Biological Process
• Cellular component
– Virulence Ontology
PBR
Molecular Evolution and Genomic
Analyses of Poxviruses
PBR
Objectives
 To better understand the role individual
genes and groups of genes (or other genetic
elements) play in poxvirus (especial
smallpox ) host range and virulence
 Try to describe and understand poxvirus
diversity via reconstruction of the families
evolutionary history
Orthopoxvirus Phylogeny
DNA Polymerase
Nucleoside triphosphatase
MPXV-ZAI
VACV-COP
CMPV-M96
VACV-COP
100
CMPV-M96
100
100
59
100
CPXV-BR
94
VARV-BSH
78
VMNV-GAR
MPXV-ZAI
CPXV-BR
ECTV-MOS
ECTV-MOS
10 nucleotide changes
VARV-BSH
100
VMNV-GAR
Orthopoxvirus Phylogeny
132 gene tree possible
65 gene tree
possible for
Chordopoxviruses
PBR
Horizontal Gene Transfer
 The acquisition of genetic material from another
organism that becomes a “permanent” addition to
the recipient’s genome
 Many poxvirus genes involved in immune evasion
may have been acquired thorough HGT
 Detection of HGT
– Alternative base composition
– Alternative codon usage pattern
– Alternative evolutionary inheritance pattern
Detecting HTGs by plotting codon usage
GC distribution of Molluscum Contagiosum
MOCV-SB1_011
MOCV-SB1_055
MOCV-SB1_132
GC distribution in Molluscum Contagiosum genome. It is smoothened by
wavelet technique. The blue number is the position in genome. The green
bars mark significant deviation and a putative gene is marked there.
VARV Proteins with Similarity to Human Proteins

























3-beta-hydroxysteroid dehydrogenase
Ankyrin
CD47 antigen
Carbonic Anhydrase
Casein kinase 1
Complement control protein
DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide
DNA ligase
Glutaredoxin
Hypothetical protein
JNK-stimulating phosphatase
Kelch-like protein
Lymphocyte activation-associated protein
Makorin zinc-finger protein
Myosin heavy chain
Plasminogen activator inhibitor
Profilin
RNA polymerase
Ribonucleotide reductase M2
SNF2 transcription activator
Serine proteinase inhibitor
Squamous cell carcinoma antigen
Superoxide dismutase
Thymidine kinase
Tumor necrosis factor receptor
Ribonucleotide Reductase Homolog Evolution
TNF Receptor Homolog Evolution
TNF Receptor GenBank nr Hits
VARV B22R BLASTN Results
Genome Comparison: Variola major vs. minor
Genome vs. Gene Phylogeny
Molecular Evolution and Genomic
Analyses of Poxviruses
We have a problem…
PBR
PBR
Poxvirus Gene Prediction
 Little consistency from one genome to
another
 Methods employed
– Minimum ORF size
– Similarity with previously described proteins
PBR
Consistently predict and annotate the
gene set for all Poxvirus genomes
 Development of a comprehensive gene
prediction tool
– Discovery of new or “missed” genes
– Removal of “pseudo” genes
 As an added bonus:
– Computational annotation of each predicted
gene
PBR
What is a gene?
 Does it looks like a gene?
– Open Reading Frame
– Base composition
– Codon usage
 Is it expressed?
– Regulatory signals
– Transcription
– Translation
 Has it been previously recognized?
– Similarity searching
PBR
Proposal gene finding tool
 Combination of a series of complementary gene prediction algorithms
 DNA Signals
– ORF detection
– Base composition
– Codon preference
– HMM gene models
 Similarity searching
– BLAST similarity searches
– Similarity to identified poxvirus protein domains using an HMM-based
domain database
 Promoter detection
– Neural Network promoter detection tool
 Patterns of amino acid sequence conservation
– Biodictionary-based analysis
 Knowledge-based integration of all predictive methods
– Computational conclusions
– Visualization tool for human inspection
Using High Performance Computing
to Speedup Bioinformatic
Applications
PBR
Features to consider in porting an
application to a cluster environment
 Balancing the processing workload among nodes is critical
to successful implementation
 A computational method with a lower percentage load
imbalance (PLIB) is more efficient than one with a higher
PLIB. The workload is perfectly balanced if PLIB is equal
to zero.
 Similarity searching workload can be difficult to estimate
– Dependent on the nature of both the database and query sequences
• sequence length
• number of sequences
• complexity of the sequences
 L arg estLoad  SmallestLoad 
PLIB  
  100
L arg estLoad


PBR
Data Segmentation
 Database Sequences
– Utilize when the database size is larger than physical memory of
each computational node
– Results need to be combined and statistics recalculated
– Not possible with some applications (PSI-BLAST)
 Query Sequences
–
–
–
–
Flexible and allows for better balancing of the workload
Statistics remain valid
Database remains intact
Best performance when the database can be fully loaded into
available memory
PBR





Work Flow for Database segmentation
Database is split evenly
and formatted
Database fragments are
sent to each node
Query file is distributed
to all nodes
The search is initiated
Output is collected for
merging and formatting
PBR
Work Flow for Query
Segmentation
 Database is distributed to all nodes
 90% of the query sequences are split into bins and
distributed among the available nodes
– Balanced for sequence length and number
 The remaining 10% query of the query sequences
are delivered to nodes as they finish the initial
search
 Individual results are merged and reported
PBR
Implementation
 Utilizes the LAM/MPI Message Passing Interface package from Indiana
University
 The application executables are not altered
– The implementation wraps the executable and data and sends it to each node
– Easily accommodate application updates
– Easily extends to similar applications
 Currently have implemented two wrappers
– BLAST
– HMMPFAM
• Sean Eddy, Washington University School of Medicine, St. Louis, Missouri
 Benchmarks performed on the UAB School of Engineer Linux cluster
– 2 storage servers (IBM x345).
– one compile node and 64 compute nodes (IBM x335)
•
•
•
•
2 x 2.4 GHz Xeon processors per node
2-4 GB of RAM per node
18 GB SCSI hard drive
connected via Gigabit Ethernet to a Cisco 4006 switch
MPI-BLAST (query segmentation)
7000
50
6000
40
30
4000
3000
20
Speedup
Total time (sec)
5000
2000
10
1000
0
0
3
7
15
31
63
Processors
MPI-BLAST (database segmentation)
7000
50
6000
40
30
4000
3000
20
2000
10
1000
0
0
4
8
16
Processors
32
64
Speedup
Total time (sec)
5000
PLIB for BLAST in query segmentation
6
5
PLIB
4
3
2
1
0
3
7
15
31
63
Processors
PLIB for BLAST in database segmentation
6
5
PLIB
4
3
2
1
0
4
8
16
Processors
32
64
60000
60
50000
50
40000
40
30000
30
20000
20
10000
10
0
Speedup
Total time (sec)
MPI-HMMPFAM ( query segmentation)
0
3
7
15
31
63
Processors
60000
60
50000
50
40000
40
30000
30
20000
20
10000
10
0
0
2
4
8
16
Processors
32
64
Speedup
Total time (sec)
MPI-HMMPFAM (database segmentation)
Comparison of gene finding methods
Methods
Pros
Cons
DNA Signal sensor
Based on empiricallyderived, statistical evidence
distinguishing biological
signals.
Difficult to distinguish background
noise from real signals. Frequently
not sensitive enough.
Content sensor
(Glimmer)
Dependent on having a
reasonable gene model.
Short genes and genes present due
to HGT are more difficult to
detect.
Similarity searching Relies on accumulated preexisting biological data.
(BLAST, HMM)
Clearly detects highly
relevant matches.
Limited to pre-existing biological
data; Sensitive to database errors
in; Difficult to detect more distant
relationships.
Promoter
detection
Reflects actual poxvirus
biology (gene expression).
Weak signals difficult to detect.
Bio-dictionaries
Useful for detecting novel
genes.
Difficult to implement; no
biological evidence.
PBR
Gene prediction: Putting it all together
ORFs
Similar searching
Glimmer
Bio-Dictionary
Promoter detector
G/C plotting
32000
34000
36000
38000
40000
PBR
Now the real work can begin:
 More rigorous comparative analysis
– Shared and unique sets of gene composition
– SNP analysis of gene differences
 Whole genome phylogenetic prediction
 Individual gene phylogenetic prediction
 Unique patterns of evolutionary inheritance
 “Clustering” of evolutionary inheritance
with pathogenesis