Download Taxonomic distribution of Large DNA viruses in the sea

Adam Monier, Jean-Michel Claverie & Hiroyuki Ogata Genome Biology 2008, 9:R106 Virus  A small infectious agent that can replicate only inside     the living cells of other organisms. Infect all types of organisms—animals, plants, bacteria and archaea. Found in almost every ecosystem on Earth The most abundant type of biological entity Consist of two or three parts:  DNA or RNA (genetic information)  Capsid protein(protects its gene)  Some may have an envelope Viruses in marine system  Abundant in the marine system: 106 to 109 virus-like     particles per milliliter of sea water Infect marine organisms from oxygen-producing phytoplankton to whales Regulate the population of many sea organisms and are important effectors of global biogeochemical fluxes Hold a great genetic diversity May significantly contribute to the evolution of microorganisms in marine ecosystems.  A quantitative description of the marine virosphere  The determination of the relative abundance of virus families  The assessment of the level of their genetic diversity. Data set  The first phase of the Sorcerer II Global Ocean Sampling (GOS) Expedition  The GOS data comprise a large environmental shotgun sequence collection, with 7.7 million sequencing reads assembled into 4.9 billion bp contigs  At least 3% of the predicted proteins contained within the GOS data are of viral origin  Most DNA samples were extracted from the 0.1-0.8 μsized fraction Methods for determining taxonomic distribution  ‘Binning' is the first step to analyze microbial populations in metagenomic sequences  Drawbacks of the use of homology search programs  BLAST scores are highly sensitive to alignment sizes and to insertions/deletions  Difficult to infer evolutionary distances among high scoring hits only from the BLAST scores. Phylogenetic analysis  Phylogenetic analysis is the process used to determine the evolutionary relationships between organisms.  The results of an analysis can be drawn in a hierarchical diagram called phylogenetic tree.  Branches are based on the hypothesized evolutionary relationships between organisms. Each member in a branch is assumed to be descended from a common ancestor. B-family DNA polymerase (PolB)  A DNA polymerase is an enzyme that catalyzes the     polymerization of deoxyribonucleotides into a DNA strand during the process of replication. B-family DNA polymerase (PolB) sequences are conserved in all known members of nucleocytoplasmic large DNA viruses The presence of PolB homologs in bacteria is limited Have strong sequence conservation and an apparently low frequency of recent horizontal transfer Pol B is a useful marker to examine taxonomic distribution of large DNA viruses in a metagenomic sequence collection Defect of normal phylogenetic methods  Short sequences in the environmental shotgun sequences.  Large variation in size and correspond to different parts of a selected marker gene  Normal phylogenetic analysis does not provide an appropriate alignment Phylogenetic mapping  A new phylogeny-based method discovered by the author  Analyzes individual metagenomic sequences one by one  Determines their phylogenetic positions using a reference multiple sequence alignment (MSA) and a reference tree This paper…  The taxonomic richness and the relative abundance of different large DNA viruses in marine environments  Analyzed the GOS data set by phylogenetic mapping  Use PolB sequences as reference Results  Phylogenetic mapping  Validation of the mapping results using long PolB fragments  Comparison of the abundance of viral PolB genes with the bacterial ones  Geographic distributions of viral PolBs  Examination of additional ORFs 1. Phylogenetic mapping  Step1: calculation of PolB fragments  Step2: generation of a reference MSA and a maximum likelihood tree  Step3: examinination of PolB fragments’ phylogenetic position Step1: Calculation of PolB fragments  Searched the GOS data set for PolB-like sequences using the Pfam hidden Markov profile (PF00136).  A set of 1,947 sequences  ‘PolB fragments’ Step2: Reference MSA and Maximum likelihood tree PolB homologs from known organisms Built a reference MSA corresponding to the polymerase domains of PolB homologs (contains 101 sequences) Generate a maximum likelihood tree Cont. Step3: Examinination of PolB fragments’ phylogenetic position  Reduce the reference MSA (51 representitives) and the     reference tree (99 branches). Conserve the original topology of the full reference tree Align each of the PolB fragments on the reference MSA using T-Coffee profile method. Compute the likelihoods for all 99 possible branching positions by ProtML. Assess the tatistical significance for the best tree by RELL bootstrap method. Taxonomic distribution of the GOS PolB fragments  Assign the best branching position Chloroviruses for 1,423 PolB fragments  1,224 (86%) were mapped on viral branches Mimiviruses  869 were supported by RELL (bootstrap value ≥ 75%)  811 were on viral branches Phages 2. Validation of the mapping results using long PolB fragments  Examined the phylogenetic mapping result and the sequence diversity of the PolB fragments classified in large eukaryotic virus groups (NCLDVs).  A single alignment of the selected long PolB fragments together with the reference PolB sequences from large eukaryotic virus groups Cont. 3. Comparison of the abundance of viral PolB genes with the bacterial ones  Read coverage was used to measure the abundance of the cognate DNA molecules.  Compute the read coverage of each contig harboring a PolB fragment  Obtain the median of the read coverage values for each branch Viral PolBs are more diverse than bacterial PolBs  Viral branches : a large number of mapped contigs exhibiting a low coverage.  Bacterial branches: a lower number of mapped contigs with a larger read coverage.  Virus populations are numerous and very diverse. 4. Geographic distributions of viral PolBs  Compare the relative abundance of the predicted viral PolB fragments and the associated metadata across different GOS sampling sites Geographic localization 5. Examination of additional ORFs  Searched the putative viral contigs against NRDB by BLASTX  ‘Virus-specific’ genes next to the PolB homologs  OtV5 putative major capsid gene [chlorovirus group branch]  regA (translation repressor of early genes) or uvsX (recAlike recombination and DNA repair protein genes) [cyanophage P-SSM4 branch] Prediction of ‘new’ viral genes  An ORF similar to RimK--a protein involved in post- translational modification of the ribosomal protein S6 – on the cyanophage P-SSM4 branch.  No rimK homolog has been found in a viral genome  Use this viral RimK homolog as a query of TBLASTN and screene the entire GOS data set. GOS contigs with putative RimK sequences  Identify more than 100 contigs harboring RimK homologs with higher similarities than those exhibited by cellular homologs in NRDB.  Many of these contigs have additional ORFs usually specific to phages. Maximum likelihood tree of RimK sequences  The RimK homologs are closely related to each other and distantly related to bacterial RimK .  The existence of phages carrying rimK homologs in marine environments. --‘new’ viral gene Conclusion  The phylogenetic mapping approach provided a comprehensive picture of the taxonomic distribution of large viruses enclosed in the GOS metagenomic data.  The highest genetic richness corresponded to phages.  The Mimiviridae represent a major and ubiquitous component of large eukaryotic DNA viruses in diverse marine environments.  Prediction of ‘new’ viral genes Pfam  Pfam is a large collection of protein families, represented by multiple sequence alignments and hidden Markov models (HMMs) T-Coffee  A multiple sequence alignment program.  Compare all the sequences two by two, producing a global alignment and a series of local alignments  Then combine all these alignments into a multiple alignment.  Allows you to combine results obtained with several alignment methods.  T-Coffee will combine all that information and produce a new multiple sequence having the best agreement whith all these methods. ProtML  Maximum Likelihood Inference of Protein Phylogeny  developed by Felsenstein  Implements the maximum likelihood method for protein amino acid sequences. It uses the either the Jones-Taylor-Thornton or the Dayhoff probability model of change between amino acids.  Uses a Hidden Markov Model (HMM) method of inferring different rates of evolution at different amino acid positions. Read coverage  Read coverage of a contig is the number of reads that contribute to the contig consensus.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Taxonomic distribution of Large DNA viruses in the sea