Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Poxviruses, Biodefense and Bioinformatics Working towards a better understanding of viral pathogenesis and evolution PBR Bioinformatics  Managing Complexity – Technology development  Enhancing Understanding – Research PBR Managing Complexity  Data – Acquisition – Storage – Manipulation – Retrieval PBR Managing Complexity…  Data Analysis – Development and Utilization of • Analytical tools • Visualization tools PBR Enhancing Understanding What distinguishes one organism from another?       Sequence Molecular Biology Physiology Pathogenesis Epidemiology Evolution Will the genomic sequence provide an explanation for the differences? PBR What is Bioinformatics?  Computer-aided analysis of biological information  Discerning the characteristic (repeatable) patterns in biological information that help to explain the properties and interactions of biological systems.  Caveat: – In the end, bioinformatics (a.k.a. computers) can only help in making inferences concerning biological processes. – These inferences (or hypotheses) have to be tested in the laboratory PBR The Poxvirus Bioinformatic Resource www.poxvirus.org PBR PBR Collaborators  UAB – Elliot Lefkowitz  St. Louis University – Mark Buller  University of Victoria – Chris Upton  ATCC – Charles Buck  Medical College of Wisconsin – Paula Traktman PBR The UAB MGBF Contingent Molecular and Genetic Bioinformatics Facility  Programmers – Jim Moon – Don Dempsey – Uma Dave – Bei Hu  Students – Chunlin Wang  Fellows – Shankar Changayil – Xiaosi Han PBR Poxviruses  Large dsDNA genome – 150,000 – 300,000 base pairs – 150 – 260 genes  Complex virion morphology  Cytoplasmic replication  Array of immunoevasion strategies.  Human pathogens – Molluscum contagiosum – Variola – Monkeypox PBR The PBR is Designed to Support Basic and applied research on Poxviruses including the development of new:  Environmental Detectors  Diagnostic Reagents  Animal Models  Vaccines  Antiviral Compounds PBR PBR Design Philosophy  Useful and Used  Supporting all poxvirus investigators – UAB PBR Web-based application requirements • Web Browser • Java plugin  In-depth analyses – UVic analytical tools PBR BLAST  Search a sequence database for primary sequence similarities to some query sequence  Provides a measure of the significance of the similarity  Does not necessarily imply common evolutionary origin  Developed at NCBI – Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410. 18 Genomes; 563 genes = Avg. 31 genes/genome PBR PBR Knowledge Database  Mini review of available structure-function information – Human-curated database based on the literature  Bibliographic information  Available scientific resources • clones, mutants, and antibodies  Empirically-derived properties – MW, pI . . . – Post-translational modifications – Expression  Functional Assignments – Gene Ontology controlled vocabulary • Molecular function • Biological Process • Cellular component – Virulence Ontology PBR Molecular Evolution and Genomic Analyses of Poxviruses PBR Objectives  To better understand the role individual genes and groups of genes (or other genetic elements) play in poxvirus (especial smallpox ) host range and virulence  Try to describe and understand poxvirus diversity via reconstruction of the families evolutionary history Orthopoxvirus Phylogeny DNA Polymerase Nucleoside triphosphatase MPXV-ZAI VACV-COP CMPV-M96 VACV-COP 100 CMPV-M96 100 100 59 100 CPXV-BR 94 VARV-BSH 78 VMNV-GAR MPXV-ZAI CPXV-BR ECTV-MOS ECTV-MOS 10 nucleotide changes VARV-BSH 100 VMNV-GAR Orthopoxvirus Phylogeny 132 gene tree possible 65 gene tree possible for Chordopoxviruses PBR Horizontal Gene Transfer  The acquisition of genetic material from another organism that becomes a “permanent” addition to the recipient’s genome  Many poxvirus genes involved in immune evasion may have been acquired thorough HGT  Detection of HGT – Alternative base composition – Alternative codon usage pattern – Alternative evolutionary inheritance pattern Detecting HTGs by plotting codon usage GC distribution of Molluscum Contagiosum MOCV-SB1_011 MOCV-SB1_055 MOCV-SB1_132 GC distribution in Molluscum Contagiosum genome. It is smoothened by wavelet technique. The blue number is the position in genome. The green bars mark significant deviation and a putative gene is marked there. VARV Proteins with Similarity to Human Proteins                          3-beta-hydroxysteroid dehydrogenase Ankyrin CD47 antigen Carbonic Anhydrase Casein kinase 1 Complement control protein DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide DNA ligase Glutaredoxin Hypothetical protein JNK-stimulating phosphatase Kelch-like protein Lymphocyte activation-associated protein Makorin zinc-finger protein Myosin heavy chain Plasminogen activator inhibitor Profilin RNA polymerase Ribonucleotide reductase M2 SNF2 transcription activator Serine proteinase inhibitor Squamous cell carcinoma antigen Superoxide dismutase Thymidine kinase Tumor necrosis factor receptor Ribonucleotide Reductase Homolog Evolution TNF Receptor Homolog Evolution TNF Receptor GenBank nr Hits VARV B22R BLASTN Results Genome Comparison: Variola major vs. minor Genome vs. Gene Phylogeny Molecular Evolution and Genomic Analyses of Poxviruses We have a problem… PBR PBR Poxvirus Gene Prediction  Little consistency from one genome to another  Methods employed – Minimum ORF size – Similarity with previously described proteins PBR Consistently predict and annotate the gene set for all Poxvirus genomes  Development of a comprehensive gene prediction tool – Discovery of new or “missed” genes – Removal of “pseudo” genes  As an added bonus: – Computational annotation of each predicted gene PBR What is a gene?  Does it looks like a gene? – Open Reading Frame – Base composition – Codon usage  Is it expressed? – Regulatory signals – Transcription – Translation  Has it been previously recognized? – Similarity searching PBR Proposal gene finding tool  Combination of a series of complementary gene prediction algorithms  DNA Signals – ORF detection – Base composition – Codon preference – HMM gene models  Similarity searching – BLAST similarity searches – Similarity to identified poxvirus protein domains using an HMM-based domain database  Promoter detection – Neural Network promoter detection tool  Patterns of amino acid sequence conservation – Biodictionary-based analysis  Knowledge-based integration of all predictive methods – Computational conclusions – Visualization tool for human inspection Using High Performance Computing to Speedup Bioinformatic Applications PBR Features to consider in porting an application to a cluster environment  Balancing the processing workload among nodes is critical to successful implementation  A computational method with a lower percentage load imbalance (PLIB) is more efficient than one with a higher PLIB. The workload is perfectly balanced if PLIB is equal to zero.  Similarity searching workload can be difficult to estimate – Dependent on the nature of both the database and query sequences • sequence length • number of sequences • complexity of the sequences  L arg estLoad  SmallestLoad  PLIB     100 L arg estLoad   PBR Data Segmentation  Database Sequences – Utilize when the database size is larger than physical memory of each computational node – Results need to be combined and statistics recalculated – Not possible with some applications (PSI-BLAST)  Query Sequences – – – – Flexible and allows for better balancing of the workload Statistics remain valid Database remains intact Best performance when the database can be fully loaded into available memory PBR      Work Flow for Database segmentation Database is split evenly and formatted Database fragments are sent to each node Query file is distributed to all nodes The search is initiated Output is collected for merging and formatting PBR Work Flow for Query Segmentation  Database is distributed to all nodes  90% of the query sequences are split into bins and distributed among the available nodes – Balanced for sequence length and number  The remaining 10% query of the query sequences are delivered to nodes as they finish the initial search  Individual results are merged and reported PBR Implementation  Utilizes the LAM/MPI Message Passing Interface package from Indiana University  The application executables are not altered – The implementation wraps the executable and data and sends it to each node – Easily accommodate application updates – Easily extends to similar applications  Currently have implemented two wrappers – BLAST – HMMPFAM • Sean Eddy, Washington University School of Medicine, St. Louis, Missouri  Benchmarks performed on the UAB School of Engineer Linux cluster – 2 storage servers (IBM x345). – one compile node and 64 compute nodes (IBM x335) • • • • 2 x 2.4 GHz Xeon processors per node 2-4 GB of RAM per node 18 GB SCSI hard drive connected via Gigabit Ethernet to a Cisco 4006 switch MPI-BLAST (query segmentation) 7000 50 6000 40 30 4000 3000 20 Speedup Total time (sec) 5000 2000 10 1000 0 0 3 7 15 31 63 Processors MPI-BLAST (database segmentation) 7000 50 6000 40 30 4000 3000 20 2000 10 1000 0 0 4 8 16 Processors 32 64 Speedup Total time (sec) 5000 PLIB for BLAST in query segmentation 6 5 PLIB 4 3 2 1 0 3 7 15 31 63 Processors PLIB for BLAST in database segmentation 6 5 PLIB 4 3 2 1 0 4 8 16 Processors 32 64 60000 60 50000 50 40000 40 30000 30 20000 20 10000 10 0 Speedup Total time (sec) MPI-HMMPFAM ( query segmentation) 0 3 7 15 31 63 Processors 60000 60 50000 50 40000 40 30000 30 20000 20 10000 10 0 0 2 4 8 16 Processors 32 64 Speedup Total time (sec) MPI-HMMPFAM (database segmentation) Comparison of gene finding methods Methods Pros Cons DNA Signal sensor Based on empiricallyderived, statistical evidence distinguishing biological signals. Difficult to distinguish background noise from real signals. Frequently not sensitive enough. Content sensor (Glimmer) Dependent on having a reasonable gene model. Short genes and genes present due to HGT are more difficult to detect. Similarity searching Relies on accumulated preexisting biological data. (BLAST, HMM) Clearly detects highly relevant matches. Limited to pre-existing biological data; Sensitive to database errors in; Difficult to detect more distant relationships. Promoter detection Reflects actual poxvirus biology (gene expression). Weak signals difficult to detect. Bio-dictionaries Useful for detecting novel genes. Difficult to implement; no biological evidence. PBR Gene prediction: Putting it all together ORFs Similar searching Glimmer Bio-Dictionary Promoter detector G/C plotting 32000 34000 36000 38000 40000 PBR Now the real work can begin:  More rigorous comparative analysis – Shared and unique sets of gene composition – SNP analysis of gene differences  Whole genome phylogenetic prediction  Individual gene phylogenetic prediction  Unique patterns of evolutionary inheritance  “Clustering” of evolutionary inheritance with pathogenesis