* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Keystone2011poster
Quantitative comparative linguistics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Human genetic variation wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Genomic imprinting wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Non-coding DNA wikipedia , lookup
Point mutation wikipedia , lookup
Transposable element wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene therapy wikipedia , lookup
Copy-number variation wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Genomic library wikipedia , lookup
Human genome wikipedia , lookup
History of genetic engineering wikipedia , lookup
Public health genomics wikipedia , lookup
Genome (book) wikipedia , lookup
Maximum parsimony (phylogenetics) wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene desert wikipedia , lookup
Minimal genome wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Pathogenomics wikipedia , lookup
Gene expression profiling wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome editing wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Metagenomics wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Identify gene markers for different taxonomic groups in Archaea and Bacteria Genomes Dongying Wu1,2, Jonathan A. Eisen1,2 1. DOE Joint Genome Institute, Walnut Creek, California 94598, USA 2. University of California, Davis, Davis, California 95616, USA Introduction The sequencing and phylogenetic analysis of rRNA molecules demonstrated that all organisms could be placed on a single tree of life. Highly conserved, homologous 16S rRNA genes' presence in all organismal lineages makes them the only universal marker that has been adopted by biologist. Unfortunately phylogenetic trees based on rRNA sequences do not always accurately reflect the evolutionary history of the organisms represented due to the occurrence of lateral gene transfer, different rates of evolution in different lineages, and convergent evolution of rRNAs that can result in rRNA sequences from distantly related species becoming more similar to each other over time. There are also difficulties in generating accurate alignments of rRNA genes. For metagenomic studies of bacteria and archaea, more phylogenetic markers are needed in addition to the small subunit rRNA gene. Phylogenetic analysis using protein markers have been limited in scope. To date, the AMPHORA package developed by our lab only include 31 protein markers for bacteria. To address such issue, we started to identifying phylogenetic markers at different taxonomic levels systematically. The progress of genomic sequencing especially the phylogenetic diversity driven Genome Encyclopedia of Bacteria and Archaea (GEBA) project provides a great dataset for our top-down approach for marker identification. B. Grouping Families at Different Taxonomic Levels into Super-families Many gene families for different phylogenetics groups might be sub-families of higher level of taxonomic groups. To capture the relationship of the gene families and identify super families that span multiple phylogentic groups, we performed a tree-based method to sort and clustering the gene families. HMM profiles were built for each of the 5133 gene families that can be marker candidates at local taxonomy levels and one consensus sequence was emitted from each HMM profile. The consensus sequences were subsequently clustered into single linkage clusters according after an all vs all BLASTP search within the consensus sequences. A neighbor-join tree was built for each cluster by FastTree. We picked the clades that had only one consensus sequence from one taxonomic group, and the sequences were distinct from other consensus sequences in the tree using the same tree sampling program we used to identify phylogentic marker candidates at local levels (Figure 2). Figure 4. The monophyletic value at different taxonomic level based on a PHYML tree of ribosomal protein S4 family. E. Identify Super-families that can be Markers for at least 5 Taxonomic Level For each super-familie, a phylogenetic tree was built. At all the taxonomic levels we were interested in, we calculated the universality (0-100), the evenness (0-100) and the monophyletic value (0-100) of the super-family members in terms of genome distributions. We only keep the super-families that can be marker candidates for at least 5 taxonomic levels (for each taxonomic group of interest, the multiplication of universality, evenness and monophyletic value should be larger than 729000). Methods and Results A. Gene Family Building for Different Taxonomic Groups Only taxonomic groups with more than five completed genomes were selected for marker identification (see Table 1 for the list). Amino acid sequences were downloaded from the JGI Integrated Microbial Genomes (IMG) website. Peptides from 9 gene families that with extraordinary high copy numbers (>1000) in the bacterial level family building were filtered out first. All vs ALL BLASTP search were performed for the entire proteomes in the group (with the e-value cutoff set at 1e-10), followed by MCL clustering (with inflation value set at 2). Neighbor join trees were build for all the MCL families. We've developed a tool to parse all the trees and identify clades with single-copies genes across the genomes in the group. HMM profiles were built for the selected clades and HMM search against the entire proteomes in the group was applied to evaluate how distinct the gene families were. We consider a gene family as a marker candidate if all the genomes in the group have only one copy of the gene family members (high universality and high evenness, see Figure 1), and the family members can be distinctly picked up by the HMM profile built for the family. 5133 families were identified that met our requirements (Table 1). Universality Evenness 100 e C. Building Phylogenetic Marker Super-families HMM profiles were built for all the potential marker super-families identified by consensus sequence trees. Genes belonging to the super-families that had not been included were retrieved by hmmsearch. D. Evaluate the Marker Super-families at Different Taxonomic Level A family might be a good marker for certain taxonomic groups but not suitable for other groups. To make a good marker, a family needs to be universally distributed across the genomes at a given taxonomy level and single-copied in each genome. Further more, the genes for a given taxonomic group need to be monophyletic in a phylogenetic tree. We developed a monophyletic measurement because we have to tolerate, to a certain degree, of divergence in terms of monophyly. The monophyletic measurement is demonstrated in Figure 3. The monophyletic value calculation for a genome family at different taxonomic level is exemplified by the ribosomal protein S4 family (see Figure 4). = 100 x Number of Genomes Covered by the Family Total Number of Genomes 4 Ng Figure 2. A neighbor-join tree of all the consensus sequences in the carbamoyl-phosphate synthase cluster. Our tree parsing and sampling program automatically identify 10 marker super-families that span multiple taxonomic groups (colored). / NiNm / i Ni: the number of the gene family members from the genome i; Nm: the medium of Ni for all the genomes with the family Ng: the number of genomes with the family Figure 1. Gene family universality and evenness calculations. Phylogenetic group Genome Number Gene Number Maker Candidates Archaea 62 145415 106 Actinobacteria 63 267783 136 Alphaproteobacteria 94 347287 121 Betaproteobacteria 56 266362 311 Gammaproteobacteria 126 483632 118 Deltaproteobacteria 25 102115 206 Epislonproteobacteria 18 33416 455 Bacteriodes 25 71531 286 Chlamydae 13 13823 560 Chloroflexi 10 33577 323 Cyanobacteria 36 124080 590 Firmicutes 106 312309 87 Spirochaetes 18 38832 176 Thermi 5 14160 974 Thermotogae 9 17037 684 Table 1. Phylogenetic Marker Candidates for Different Taxonomic Groups. Monophyletic factor = Clade number factor= 1- [(c-1)/n]2 Clade Size Monophyletic Assumption Square of Distance s1 n (n-S1)2 s2 0 S2 2 … 0 … sc 0 Sc2 (sort by size) Euclidean distance Compare the clade distribution to the ideal monophylic scenario Monophyletic Value= 100 X Clade number factor X monophyletic factor Figure 3. Monophyletic value calculation. In a phylogenetic tree, a list of taxa (total number is n) can be divided in a number of monophyletic clades (clade number is c) by a phylogenetic tree. S is the number of taxa in a given monophyletic clade. Figure 5. The multiplication of universality, evenness and monophyletic value at different taxonomic levels for 209 super-families. Conclusions We’ve established a protocol for the automatic identifications of phylogenetic markers for any given phylogentic groups. The protocol uses BLAST and MCL clustering algorithms to generate gene families for a selection of genomes. Phylogenetic trees are built for the gene families and clades from the trees are automatically sampled and evaluated for universality and evenness in terms of their distributions across genomes of interest. HMM profiles are built for universally distributed single-copied genes that form monophyletic clades on phylogenetic trees. HMM searches against the genomes of interest are applied to help us to decide the the families that are potential phylogenetic markers. We’ve build gene families and captured potential markers at the follow taxonomic levels: Bacteria, Archaea, Actinobacteria, Bacteroides, Chlamydae, Chloroflexi, Cyanobacteria, Firmicutes, Spirochaetes, Thermi, Thermotogae, and five classes of Proteobacteria. HMM profiles were built for 5133 families that can be potential markers at the taxonomic levels we’ve examined. We’ve also carried out clustering and tree building analysis for all the taxonomic specific marker families. As a result, we’ve identified 209 super-families that each can be a potential phylogentic marker for at least 5 taxonomic groups we’ve studied. We are currently using the novel phylogenetic markers we’ve discovered to analyze metagenomic data, thus evaluating their impact in community diversity and richness studies.